學術產出-Theses
Article View/Open
Publication Export
-
題名 建立ARIMA與SVR混合式模型,結合GDELT數位新聞資料集預測美元指數
Constructing A Hybrid Model of ARIMA and SVR Algorithm with GDELT Digital News Dataset to Predict U.S. Dollar Index作者 沈柏宇
Shen, Po-Yu貢獻者 廖四郎
Liao, Szu-Lang
沈柏宇
Shen, Po-Yu關鍵詞 GDELT專案
混合式模型
美元指數
GDELT project
Hybrid model
U.S. dollar index日期 2018 上傳時間 3-Jul-2018 17:27:00 (UTC+8) 摘要 新聞資訊為基本面分析的重要訊息來源,如何利用數位新聞資料輔助或彌補傳統計量模型的價格預測能力,首先借助具有規模且公開的數位新聞資料集 — GDELT 專案,豐富的新聞來源經過嚴謹的文字探勘與自然語言處理所得到的結構化資料,結合本研究提出的資料前處理方法,接續做為混合式模型中大數據分析方法的特徵值,用以預測美元指數的價格行為,比較不同模型之間的成效。 針對時間序列的資料,本研究採用兩層的滾動窗格分析方法,作為模型成效評估依據的測試資料集選取三種不同的時間區間:發生歐債危機前(2009/06/02~2009/11/30,130筆日資料)、歐債危機擴散中(2009/12/01~2010/12/01,260筆日資料)與歐債危機過後(2017/01/02~2017/06/30,130筆日資料)。實作的成果顯示出,在發生歐債危機前與危機過後的兩個區間當中,有加入 GDELT 特徵值的混合式模型表現優於單純的 ARIAM 迴歸模型,歐債危機擴散中的表現則不然;本研究認為金融危機擴散期間,市場的價格與財金相關的新聞之間存在更強的鏈結,缺乏財金相關新聞資訊的 GDELT 資料集在此情境之下,模型的表現自然會受到限制甚至更差。 實作的資料量體龐大,資料處理與計算的過程仰賴叢集式架構的平行運算,因此使用到 Google Cloud Platform 的雲端虛擬機租借服務,以及在虛擬機上方操作 Spark 叢集式運算平台,完成類即時的滾動式窗格分析流程。
The information implied in the news is an important signal for fundamental analysis. In this research, we are going to improve the accuracy on price prediction of traditional econometric model with news messages. First of all, this research adopt the data from the GDELT Project which has abundant resources and well performed text mining technique. With series of data preprocessing, we build up several hybrid models made up of ARIMA model and big data analysis model, some of them take the preprocessed GDELT messages as features. Finally, performances of different models depend on the mean square error. In the rolling window analysis, this study take different periods of time as testing data sets : before the European debt crisis (2009/06/02~2009/11/30), under the crisis (2009/12/01~2010/12/01) and after the crisis (2017/01/02~2017/06/30). Results show that hybrid models with GDELT features have better performance than pure ARIMA model in the prediction of U.S. Dollar Index in the first and last period. However, those models work poorly in the European debt crisis. Considering the great volume of data, the pipeline of data preprocessing and data analysis relies on parallel operation of cluster architecture. In that way, this study use the virtual machines rent services supported by Google Cloud Platform and operate on PySpark to simulate real-time rolling window analysis.參考文獻 [1] 黃書瑋 (民106),建構GDELT數位新聞分析流程於Spark大數據平台:以新聞 事件影響力探究美國S&P股市指數變化為例,國立政治大學資訊科學系碩士在 職專班論文。 [2] Bergmeir, Christoph, Rob J. Hyndman, & Bonsoo Koo. (2018). A note on the validity of cross-validation for evaluating autoregressive time series prediction. Computational Statistics & Data Analysis, 120, 70-83. [3] Caporale, G. M., Spagnolo, F., & Spagnolo, N. (2017). Macro news and exchange rates in the BRICS. Finance Research Letters, 21, 140-143. [4] Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning. New York: Springer series in statistics. [5] Gidofalvi, G., & Elkan, C. (2001). Using news articles to predict stock price movements. Department of Computer Science and Engineering, University of California, San Diego. [6] James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning. New York: Springer. [7] Karau, H., Konwinski, A., Wendell, P., & Zaharia, M. (2015). Learning spark: lightning-fast big data analysis. " O`Reilly Media, Inc.". [8] Kohavi, R. (1995, August). A study of cross-validation and bootstrap for accuracy estimation and model selection. In Ijcai (Vol. 14, No. 2, pp. 1137-1145). [9] Lizardo, R. A., & Mollick, A. V. (2010). Oil price fluctuations and US dollar exchange rates. Energy Economics, 32(2), 399-408. [10] Loretan, M. (2005). Indexes of the foriegn exchange value of the dollar. Fed. Res. Bull., 91, 1. [11] Mishra, S. (2017). Studying geo-conflict and cooperation over time using media reports: A case study using temporal geographical maps. [12] Mitchell, T. M. (1997). Machine learning. WCB. [13] Pai, P. F., & Lin, C. S. (2005). A hybrid ARIMA and support vector machines model in stock price forecasting. Omega, 33(6), 497-505. [14] Schrodt, P. (2012). Conflict and Mediation Event Observations event and actor codebook V. 1.1 b3. [15] Smola, A. J., & Schölkopf, B. (2004). A tutorial on support vector regression. Statistics and computing, 14(3), 199-222. [16] Tanenbaum, Andrew S., & Maarten Van Steen. (2017). Distributed Systems 3rd edition. Pearson Education, Inc. [17] Tetlock, P. C. (2007). Giving content to investor sentiment: The role of media in the stock market. The Journal of finance, 62(3), 1139-1168. [18] Wu, G. G. R., Hou, T. C. T., & Lin, J. L. (2018). Can economic news predict Taiwan stock market returns?. Asia Pacific Management Review. [19] Yoshioka, M., Allan, M. J. J., & Kando, N. (2018). Visualizing Polarity-based Stances of News Websites. [20] Zhang, G. P. (2003). Time series forecasting using a hybrid ARIMA and neural network model. Neurocomputing, 50, 159-175. [21] Zhang, G., & Hu, M. Y. (1998). Neural network forecasting of the British pound/US dollar exchange rate. Omega, 26(4), 495-506. [22] Zhu, B., & Wei, Y. (2013). Carbon price forecasting with a novel hybrid ARIMA and least squares support vector machines methodology. Omega, 41(3), 517-524. 描述 碩士
國立政治大學
金融學系
105352034資料來源 http://thesis.lib.nccu.edu.tw/record/#G0105352034 資料類型 thesis dc.contributor.advisor 廖四郎 zh_TW dc.contributor.advisor Liao, Szu-Lang en_US dc.contributor.author (Authors) 沈柏宇 zh_TW dc.contributor.author (Authors) Shen, Po-Yu en_US dc.creator (作者) 沈柏宇 zh_TW dc.creator (作者) Shen, Po-Yu en_US dc.date (日期) 2018 en_US dc.date.accessioned 3-Jul-2018 17:27:00 (UTC+8) - dc.date.available 3-Jul-2018 17:27:00 (UTC+8) - dc.date.issued (上傳時間) 3-Jul-2018 17:27:00 (UTC+8) - dc.identifier (Other Identifiers) G0105352034 en_US dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/118242 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 金融學系 zh_TW dc.description (描述) 105352034 zh_TW dc.description.abstract (摘要) 新聞資訊為基本面分析的重要訊息來源,如何利用數位新聞資料輔助或彌補傳統計量模型的價格預測能力,首先借助具有規模且公開的數位新聞資料集 — GDELT 專案,豐富的新聞來源經過嚴謹的文字探勘與自然語言處理所得到的結構化資料,結合本研究提出的資料前處理方法,接續做為混合式模型中大數據分析方法的特徵值,用以預測美元指數的價格行為,比較不同模型之間的成效。 針對時間序列的資料,本研究採用兩層的滾動窗格分析方法,作為模型成效評估依據的測試資料集選取三種不同的時間區間:發生歐債危機前(2009/06/02~2009/11/30,130筆日資料)、歐債危機擴散中(2009/12/01~2010/12/01,260筆日資料)與歐債危機過後(2017/01/02~2017/06/30,130筆日資料)。實作的成果顯示出,在發生歐債危機前與危機過後的兩個區間當中,有加入 GDELT 特徵值的混合式模型表現優於單純的 ARIAM 迴歸模型,歐債危機擴散中的表現則不然;本研究認為金融危機擴散期間,市場的價格與財金相關的新聞之間存在更強的鏈結,缺乏財金相關新聞資訊的 GDELT 資料集在此情境之下,模型的表現自然會受到限制甚至更差。 實作的資料量體龐大,資料處理與計算的過程仰賴叢集式架構的平行運算,因此使用到 Google Cloud Platform 的雲端虛擬機租借服務,以及在虛擬機上方操作 Spark 叢集式運算平台,完成類即時的滾動式窗格分析流程。 zh_TW dc.description.abstract (摘要) The information implied in the news is an important signal for fundamental analysis. In this research, we are going to improve the accuracy on price prediction of traditional econometric model with news messages. First of all, this research adopt the data from the GDELT Project which has abundant resources and well performed text mining technique. With series of data preprocessing, we build up several hybrid models made up of ARIMA model and big data analysis model, some of them take the preprocessed GDELT messages as features. Finally, performances of different models depend on the mean square error. In the rolling window analysis, this study take different periods of time as testing data sets : before the European debt crisis (2009/06/02~2009/11/30), under the crisis (2009/12/01~2010/12/01) and after the crisis (2017/01/02~2017/06/30). Results show that hybrid models with GDELT features have better performance than pure ARIMA model in the prediction of U.S. Dollar Index in the first and last period. However, those models work poorly in the European debt crisis. Considering the great volume of data, the pipeline of data preprocessing and data analysis relies on parallel operation of cluster architecture. In that way, this study use the virtual machines rent services supported by Google Cloud Platform and operate on PySpark to simulate real-time rolling window analysis. en_US dc.description.tableofcontents 謝辭 I 摘要 II Abstract III 目錄 IV 表次 VI 圖次 VII 第一章 導論 1 第一節 研究動機 1 第二節 研究目的 2 第三節 研究成果 2 第二章 研究資料集介紹 4 第一節 GDELT 專案 4 第二節 美元指數 8 第三章 相關研究 10 第一節 量化的新聞事件預測金融市場價格 10 第二節 混合式模型研究案例 11 第四章 研究方法與架構 12 第一節 資料取得 12 第二節 資料前處理 13 第三節 滾動式窗格分析 15 第四節 支持向量迴歸模型 16 第五節 混合式模型 19 第五章 研究實作與結果 20 第一節 架設 GCP 雲端虛擬機環境 20 第二節 資料取得與前處理 22 第三節 建立混合式模型 24 第四節 模型實作結果比較 25 第六章 結論與未來研究 27 參考文獻 28 zh_TW dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0105352034 en_US dc.subject (關鍵詞) GDELT專案 zh_TW dc.subject (關鍵詞) 混合式模型 zh_TW dc.subject (關鍵詞) 美元指數 zh_TW dc.subject (關鍵詞) GDELT project en_US dc.subject (關鍵詞) Hybrid model en_US dc.subject (關鍵詞) U.S. dollar index en_US dc.title (題名) 建立ARIMA與SVR混合式模型,結合GDELT數位新聞資料集預測美元指數 zh_TW dc.title (題名) Constructing A Hybrid Model of ARIMA and SVR Algorithm with GDELT Digital News Dataset to Predict U.S. Dollar Index en_US dc.type (資料類型) thesis en_US dc.relation.reference (參考文獻) [1] 黃書瑋 (民106),建構GDELT數位新聞分析流程於Spark大數據平台:以新聞 事件影響力探究美國S&P股市指數變化為例,國立政治大學資訊科學系碩士在 職專班論文。 [2] Bergmeir, Christoph, Rob J. Hyndman, & Bonsoo Koo. (2018). A note on the validity of cross-validation for evaluating autoregressive time series prediction. Computational Statistics & Data Analysis, 120, 70-83. [3] Caporale, G. M., Spagnolo, F., & Spagnolo, N. (2017). Macro news and exchange rates in the BRICS. Finance Research Letters, 21, 140-143. [4] Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning. New York: Springer series in statistics. [5] Gidofalvi, G., & Elkan, C. (2001). Using news articles to predict stock price movements. Department of Computer Science and Engineering, University of California, San Diego. [6] James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning. New York: Springer. [7] Karau, H., Konwinski, A., Wendell, P., & Zaharia, M. (2015). Learning spark: lightning-fast big data analysis. " O`Reilly Media, Inc.". [8] Kohavi, R. (1995, August). A study of cross-validation and bootstrap for accuracy estimation and model selection. In Ijcai (Vol. 14, No. 2, pp. 1137-1145). [9] Lizardo, R. A., & Mollick, A. V. (2010). Oil price fluctuations and US dollar exchange rates. Energy Economics, 32(2), 399-408. [10] Loretan, M. (2005). Indexes of the foriegn exchange value of the dollar. Fed. Res. Bull., 91, 1. [11] Mishra, S. (2017). Studying geo-conflict and cooperation over time using media reports: A case study using temporal geographical maps. [12] Mitchell, T. M. (1997). Machine learning. WCB. [13] Pai, P. F., & Lin, C. S. (2005). A hybrid ARIMA and support vector machines model in stock price forecasting. Omega, 33(6), 497-505. [14] Schrodt, P. (2012). Conflict and Mediation Event Observations event and actor codebook V. 1.1 b3. [15] Smola, A. J., & Schölkopf, B. (2004). A tutorial on support vector regression. Statistics and computing, 14(3), 199-222. [16] Tanenbaum, Andrew S., & Maarten Van Steen. (2017). Distributed Systems 3rd edition. Pearson Education, Inc. [17] Tetlock, P. C. (2007). Giving content to investor sentiment: The role of media in the stock market. The Journal of finance, 62(3), 1139-1168. [18] Wu, G. G. R., Hou, T. C. T., & Lin, J. L. (2018). Can economic news predict Taiwan stock market returns?. Asia Pacific Management Review. [19] Yoshioka, M., Allan, M. J. J., & Kando, N. (2018). Visualizing Polarity-based Stances of News Websites. [20] Zhang, G. P. (2003). Time series forecasting using a hybrid ARIMA and neural network model. Neurocomputing, 50, 159-175. [21] Zhang, G., & Hu, M. Y. (1998). Neural network forecasting of the British pound/US dollar exchange rate. Omega, 26(4), 495-506. [22] Zhu, B., & Wei, Y. (2013). Carbon price forecasting with a novel hybrid ARIMA and least squares support vector machines methodology. Omega, 41(3), 517-524. zh_TW dc.identifier.doi (DOI) 10.6814/THE.NCCU.MB.003.2018.F06 -