學術產出-Theses

Article View/Open

Publication Export

Google ScholarTM

政大圖書館

Citation Infomation

題名 建立ARIMA與SVR混合式模型,結合GDELT數位新聞資料集預測美元指數
Constructing A Hybrid Model of ARIMA and SVR Algorithm with GDELT Digital News Dataset to Predict U.S. Dollar Index
作者 沈柏宇
Shen, Po-Yu
貢獻者 廖四郎
Liao, Szu-Lang
沈柏宇
Shen, Po-Yu
關鍵詞 GDELT專案
混合式模型
美元指數
GDELT project
Hybrid model
U.S. dollar index
日期 2018
上傳時間 3-Jul-2018 17:27:00 (UTC+8)
摘要 新聞資訊為基本面分析的重要訊息來源,如何利用數位新聞資料輔助或彌補傳統計量模型的價格預測能力,首先借助具有規模且公開的數位新聞資料集 — GDELT 專案,豐富的新聞來源經過嚴謹的文字探勘與自然語言處理所得到的結構化資料,結合本研究提出的資料前處理方法,接續做為混合式模型中大數據分析方法的特徵值,用以預測美元指數的價格行為,比較不同模型之間的成效。
     針對時間序列的資料,本研究採用兩層的滾動窗格分析方法,作為模型成效評估依據的測試資料集選取三種不同的時間區間:發生歐債危機前(2009/06/02~2009/11/30,130筆日資料)、歐債危機擴散中(2009/12/01~2010/12/01,260筆日資料)與歐債危機過後(2017/01/02~2017/06/30,130筆日資料)。實作的成果顯示出,在發生歐債危機前與危機過後的兩個區間當中,有加入 GDELT 特徵值的混合式模型表現優於單純的 ARIAM 迴歸模型,歐債危機擴散中的表現則不然;本研究認為金融危機擴散期間,市場的價格與財金相關的新聞之間存在更強的鏈結,缺乏財金相關新聞資訊的 GDELT 資料集在此情境之下,模型的表現自然會受到限制甚至更差。
     實作的資料量體龐大,資料處理與計算的過程仰賴叢集式架構的平行運算,因此使用到 Google Cloud Platform 的雲端虛擬機租借服務,以及在虛擬機上方操作 Spark 叢集式運算平台,完成類即時的滾動式窗格分析流程。
The information implied in the news is an important signal for fundamental analysis. In this research, we are going to improve the accuracy on price prediction of traditional econometric model with news messages. First of all, this research adopt the data from the GDELT Project which has abundant resources and well performed text mining technique. With series of data preprocessing, we build up several hybrid models made up of ARIMA model and big data analysis model, some of them take the preprocessed GDELT messages as features. Finally, performances of different models depend on the mean square error.
     In the rolling window analysis, this study take different periods of time as testing data sets : before the European debt crisis (2009/06/02~2009/11/30), under the crisis (2009/12/01~2010/12/01) and after the crisis (2017/01/02~2017/06/30). Results show that hybrid models with GDELT features have better performance than pure ARIMA model in the prediction of U.S. Dollar Index in the first and last period. However, those models work poorly in the European debt crisis.
     Considering the great volume of data, the pipeline of data preprocessing and data analysis relies on parallel operation of cluster architecture. In that way, this study use the virtual machines rent services supported by Google Cloud Platform and operate on PySpark to simulate real-time rolling window analysis.
參考文獻 [1] 黃書瑋 (民106),建構GDELT數位新聞分析流程於Spark大數據平台:以新聞 事件影響力探究美國S&P股市指數變化為例,國立政治大學資訊科學系碩士在 職專班論文。
     [2] Bergmeir, Christoph, Rob J. Hyndman, & Bonsoo Koo. (2018). A note on the validity of cross-validation for evaluating autoregressive time series prediction. Computational Statistics & Data Analysis, 120, 70-83.
     [3] Caporale, G. M., Spagnolo, F., & Spagnolo, N. (2017). Macro news and exchange rates in the BRICS. Finance Research Letters, 21, 140-143.
     [4] Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning. New York: Springer series in statistics.
     [5] Gidofalvi, G., & Elkan, C. (2001). Using news articles to predict stock price movements. Department of Computer Science and Engineering, University of California, San Diego.
     [6] James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning. New York: Springer.
     [7] Karau, H., Konwinski, A., Wendell, P., & Zaharia, M. (2015). Learning spark: lightning-fast big data analysis. " O`Reilly Media, Inc.".
     [8] Kohavi, R. (1995, August). A study of cross-validation and bootstrap for accuracy estimation and model selection. In Ijcai (Vol. 14, No. 2, pp. 1137-1145).
     [9] Lizardo, R. A., & Mollick, A. V. (2010). Oil price fluctuations and US dollar exchange rates. Energy Economics, 32(2), 399-408.
     [10] Loretan, M. (2005). Indexes of the foriegn exchange value of the dollar. Fed. Res. Bull., 91, 1.
     [11] Mishra, S. (2017). Studying geo-conflict and cooperation over time using media reports: A case study using temporal geographical maps.
     [12] Mitchell, T. M. (1997). Machine learning. WCB.
     [13] Pai, P. F., & Lin, C. S. (2005). A hybrid ARIMA and support vector machines model in stock price forecasting. Omega, 33(6), 497-505.
     [14] Schrodt, P. (2012). Conflict and Mediation Event Observations event and actor codebook V. 1.1 b3.
     [15] Smola, A. J., & Schölkopf, B. (2004). A tutorial on support vector regression. Statistics and computing, 14(3), 199-222.
     [16] Tanenbaum, Andrew S., & Maarten Van Steen. (2017). Distributed Systems 3rd edition. Pearson Education, Inc.
     [17] Tetlock, P. C. (2007). Giving content to investor sentiment: The role of media in the stock market. The Journal of finance, 62(3), 1139-1168.
     [18] Wu, G. G. R., Hou, T. C. T., & Lin, J. L. (2018). Can economic news predict Taiwan stock market returns?. Asia Pacific Management Review.
     [19] Yoshioka, M., Allan, M. J. J., & Kando, N. (2018). Visualizing Polarity-based Stances of News Websites.
     [20] Zhang, G. P. (2003). Time series forecasting using a hybrid ARIMA and neural network model. Neurocomputing, 50, 159-175.
     [21] Zhang, G., & Hu, M. Y. (1998). Neural network forecasting of the British pound/US dollar exchange rate. Omega, 26(4), 495-506.
     [22] Zhu, B., & Wei, Y. (2013). Carbon price forecasting with a novel hybrid ARIMA and least squares support vector machines methodology. Omega, 41(3), 517-524.
描述 碩士
國立政治大學
金融學系
105352034
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0105352034
資料類型 thesis
dc.contributor.advisor 廖四郎zh_TW
dc.contributor.advisor Liao, Szu-Langen_US
dc.contributor.author (Authors) 沈柏宇zh_TW
dc.contributor.author (Authors) Shen, Po-Yuen_US
dc.creator (作者) 沈柏宇zh_TW
dc.creator (作者) Shen, Po-Yuen_US
dc.date (日期) 2018en_US
dc.date.accessioned 3-Jul-2018 17:27:00 (UTC+8)-
dc.date.available 3-Jul-2018 17:27:00 (UTC+8)-
dc.date.issued (上傳時間) 3-Jul-2018 17:27:00 (UTC+8)-
dc.identifier (Other Identifiers) G0105352034en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/118242-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 金融學系zh_TW
dc.description (描述) 105352034zh_TW
dc.description.abstract (摘要) 新聞資訊為基本面分析的重要訊息來源,如何利用數位新聞資料輔助或彌補傳統計量模型的價格預測能力,首先借助具有規模且公開的數位新聞資料集 — GDELT 專案,豐富的新聞來源經過嚴謹的文字探勘與自然語言處理所得到的結構化資料,結合本研究提出的資料前處理方法,接續做為混合式模型中大數據分析方法的特徵值,用以預測美元指數的價格行為,比較不同模型之間的成效。
     針對時間序列的資料,本研究採用兩層的滾動窗格分析方法,作為模型成效評估依據的測試資料集選取三種不同的時間區間:發生歐債危機前(2009/06/02~2009/11/30,130筆日資料)、歐債危機擴散中(2009/12/01~2010/12/01,260筆日資料)與歐債危機過後(2017/01/02~2017/06/30,130筆日資料)。實作的成果顯示出,在發生歐債危機前與危機過後的兩個區間當中,有加入 GDELT 特徵值的混合式模型表現優於單純的 ARIAM 迴歸模型,歐債危機擴散中的表現則不然;本研究認為金融危機擴散期間,市場的價格與財金相關的新聞之間存在更強的鏈結,缺乏財金相關新聞資訊的 GDELT 資料集在此情境之下,模型的表現自然會受到限制甚至更差。
     實作的資料量體龐大,資料處理與計算的過程仰賴叢集式架構的平行運算,因此使用到 Google Cloud Platform 的雲端虛擬機租借服務,以及在虛擬機上方操作 Spark 叢集式運算平台,完成類即時的滾動式窗格分析流程。
zh_TW
dc.description.abstract (摘要) The information implied in the news is an important signal for fundamental analysis. In this research, we are going to improve the accuracy on price prediction of traditional econometric model with news messages. First of all, this research adopt the data from the GDELT Project which has abundant resources and well performed text mining technique. With series of data preprocessing, we build up several hybrid models made up of ARIMA model and big data analysis model, some of them take the preprocessed GDELT messages as features. Finally, performances of different models depend on the mean square error.
     In the rolling window analysis, this study take different periods of time as testing data sets : before the European debt crisis (2009/06/02~2009/11/30), under the crisis (2009/12/01~2010/12/01) and after the crisis (2017/01/02~2017/06/30). Results show that hybrid models with GDELT features have better performance than pure ARIMA model in the prediction of U.S. Dollar Index in the first and last period. However, those models work poorly in the European debt crisis.
     Considering the great volume of data, the pipeline of data preprocessing and data analysis relies on parallel operation of cluster architecture. In that way, this study use the virtual machines rent services supported by Google Cloud Platform and operate on PySpark to simulate real-time rolling window analysis.
en_US
dc.description.tableofcontents 謝辭 I
     摘要 II
     Abstract III
     目錄 IV
     表次 VI
     圖次 VII
     第一章 導論 1
     第一節 研究動機 1
     第二節 研究目的 2
     第三節 研究成果 2
     第二章 研究資料集介紹 4
     第一節 GDELT 專案 4
     第二節 美元指數 8
     第三章 相關研究 10
     第一節 量化的新聞事件預測金融市場價格 10
     第二節 混合式模型研究案例 11
     第四章 研究方法與架構 12
     第一節 資料取得 12
     第二節 資料前處理 13
     第三節 滾動式窗格分析 15
     第四節 支持向量迴歸模型 16
     第五節 混合式模型 19
     第五章 研究實作與結果 20
     第一節 架設 GCP 雲端虛擬機環境 20
     第二節 資料取得與前處理 22
     第三節 建立混合式模型 24
     第四節 模型實作結果比較 25
     第六章 結論與未來研究 27
     參考文獻 28
zh_TW
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0105352034en_US
dc.subject (關鍵詞) GDELT專案zh_TW
dc.subject (關鍵詞) 混合式模型zh_TW
dc.subject (關鍵詞) 美元指數zh_TW
dc.subject (關鍵詞) GDELT projecten_US
dc.subject (關鍵詞) Hybrid modelen_US
dc.subject (關鍵詞) U.S. dollar indexen_US
dc.title (題名) 建立ARIMA與SVR混合式模型,結合GDELT數位新聞資料集預測美元指數zh_TW
dc.title (題名) Constructing A Hybrid Model of ARIMA and SVR Algorithm with GDELT Digital News Dataset to Predict U.S. Dollar Indexen_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) [1] 黃書瑋 (民106),建構GDELT數位新聞分析流程於Spark大數據平台:以新聞 事件影響力探究美國S&P股市指數變化為例,國立政治大學資訊科學系碩士在 職專班論文。
     [2] Bergmeir, Christoph, Rob J. Hyndman, & Bonsoo Koo. (2018). A note on the validity of cross-validation for evaluating autoregressive time series prediction. Computational Statistics & Data Analysis, 120, 70-83.
     [3] Caporale, G. M., Spagnolo, F., & Spagnolo, N. (2017). Macro news and exchange rates in the BRICS. Finance Research Letters, 21, 140-143.
     [4] Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning. New York: Springer series in statistics.
     [5] Gidofalvi, G., & Elkan, C. (2001). Using news articles to predict stock price movements. Department of Computer Science and Engineering, University of California, San Diego.
     [6] James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning. New York: Springer.
     [7] Karau, H., Konwinski, A., Wendell, P., & Zaharia, M. (2015). Learning spark: lightning-fast big data analysis. " O`Reilly Media, Inc.".
     [8] Kohavi, R. (1995, August). A study of cross-validation and bootstrap for accuracy estimation and model selection. In Ijcai (Vol. 14, No. 2, pp. 1137-1145).
     [9] Lizardo, R. A., & Mollick, A. V. (2010). Oil price fluctuations and US dollar exchange rates. Energy Economics, 32(2), 399-408.
     [10] Loretan, M. (2005). Indexes of the foriegn exchange value of the dollar. Fed. Res. Bull., 91, 1.
     [11] Mishra, S. (2017). Studying geo-conflict and cooperation over time using media reports: A case study using temporal geographical maps.
     [12] Mitchell, T. M. (1997). Machine learning. WCB.
     [13] Pai, P. F., & Lin, C. S. (2005). A hybrid ARIMA and support vector machines model in stock price forecasting. Omega, 33(6), 497-505.
     [14] Schrodt, P. (2012). Conflict and Mediation Event Observations event and actor codebook V. 1.1 b3.
     [15] Smola, A. J., & Schölkopf, B. (2004). A tutorial on support vector regression. Statistics and computing, 14(3), 199-222.
     [16] Tanenbaum, Andrew S., & Maarten Van Steen. (2017). Distributed Systems 3rd edition. Pearson Education, Inc.
     [17] Tetlock, P. C. (2007). Giving content to investor sentiment: The role of media in the stock market. The Journal of finance, 62(3), 1139-1168.
     [18] Wu, G. G. R., Hou, T. C. T., & Lin, J. L. (2018). Can economic news predict Taiwan stock market returns?. Asia Pacific Management Review.
     [19] Yoshioka, M., Allan, M. J. J., & Kando, N. (2018). Visualizing Polarity-based Stances of News Websites.
     [20] Zhang, G. P. (2003). Time series forecasting using a hybrid ARIMA and neural network model. Neurocomputing, 50, 159-175.
     [21] Zhang, G., & Hu, M. Y. (1998). Neural network forecasting of the British pound/US dollar exchange rate. Omega, 26(4), 495-506.
     [22] Zhu, B., & Wei, Y. (2013). Carbon price forecasting with a novel hybrid ARIMA and least squares support vector machines methodology. Omega, 41(3), 517-524.
zh_TW
dc.identifier.doi (DOI) 10.6814/THE.NCCU.MB.003.2018.F06-