Publications-Theses
Article View/Open
Publication Export
-
Google ScholarTM
NCCU Library
Citation Infomation
Related Publications in TAIR
題名 建構GDELT數位新聞分析流程於Spark大數據平台:以新聞事件影響力探究美國S&P股市指數變化為例
Establishing GDELT digital news analytics pipeline on the Spark platform : exploiting news events influences on S&P stock index variations as an example作者 黃書瑋
Huang, Shu Wei貢獻者 胡毓忠
Hu, Yuh Jong
黃書瑋
Huang, Shu Wei關鍵詞 GDELT專案
滾動式機器學習
大數據分析流程
新聞影響力
亞馬遜網路服務
GDELT project
Rolling-Window machine learning
Big data analysis pipeline
News events influences
AWS日期 2017 上傳時間 10-Aug-2017 10:18:59 (UTC+8) 摘要 於2013年正式公開的GDELT專案號稱能監控全球65種發行語言的數位新聞媒體,利用現今完善的機器學習演算法、自然語言處理及深度學習等先進人工智慧技術,將寶貴的新聞資料,萃取與轉換成具有58組欄位資訊的結構化資料,提供各領域進一步研究與應用。本研究以GDELT新聞事件資料集來開發大數據資料分析流程,並且利用Spark ML Pipeline的技術,在亞馬遜網路服務(AWS)的雲端平台上,完成以滾動式機器學習演算法,來進行以GDELT資料為主的美國標普500(S&P 500)股市指數追蹤,與特定「佔領華爾街」事件影響力的因果分析。本研究所採用的45天滾動式隨機森林模型,在歷史指數的追蹤與預測表現上,獲得了方均根差僅43.35(誤差2.12%)的優異成果;於雲端系統上的15分鐘近即時滾動式預測誤差,更是低於1.5%。在因果分析方面,本研究採用貝氏時間序列模型分析「佔領華爾街」事件影響股市的反事實指數,闡釋該事件的發生與後續效應,促使S&P 500股市指數在觀察區間中上漲116.76點。 參考文獻 [1] Box, George EP, et al. Time series analysis: forecasting and control. John Wiley& Sons, 2015.[2] Breiman, Leo. ”Random forests.” Machine learning 45.1 (2001): 5-32.[3] Brodersen, Kay H., et al. ”Inferring causal impact using Bayesian structural timeseries models.” The Annals of Applied Statistics 9.1 (2015): 247-274.[4] Dietterich, Thomas G. ”Ensemble methods in machine learning.” Internationalworkshop on multiple classifier systems. Springer Berlin Heidelberg, 2000.[5] Elwert, Felix. ”Graphical causal models.” Handbook of causal analysis for socialresearch. Springer Netherlands, 2013. 245-273.[6] Gerner, Deborah J., et al. ”Conflict and mediation event observations (CAMEO):A new event data framework for the analysis of foreign policy interactions.” International Studies Association, New Orleans (2002).[7] Granger, Clive WJ. ”Investigating causal relations by econometric models andcross-spectral methods.” Econometrica: Journal of the Econometric Society(1969): 424-438.[8] Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. ”Overview of supervisedlearning.” The elements of statistical learning. Springer New York, 2009. 9-41.[9] Jiang, Lei, and Fan Mai. ”Discovering bilateral and multilateral causal events inGDELT.” international conference on social computing, behavioral-cultural modeling, and prediction, Washington, DC. 2014.[10] Kane, Michael J., et al. ”Comparison of ARIMA and Random Forest time seriesmodels for prediction of avian influenza H5N1 outbreaks.” BMC bioinformatics15.1 (2014): 276.[11] Keertipati, Swetha, et al. ”Multi-Level Analysis of Peace and Conflict Data inGDELT.” Proceedings of the MLSDA 2014 2nd Workshop on Machine Learningfor Sensory Data Analysis. ACM, 2014.[12] Kumar, Sumeet, Matthew Benigni, and Kathleen M. Carley. ”The impact of UScyber policies on cyber-attacks trend.” Intelligence and Security Informatics (ISI),2016 IEEE Conference on. IEEE, 2016.[13] Leetaru, Kalev, and Philip A. Schrodt. ”Gdelt: Global data on events, location, andtone, 1979ȉ 2012.” ISA Annual Convention. Vol. 2. No. 4. 2013.[14] Lindquist, Martin A., and Michael E. Sobel. ”Graphical models, potential outcomes and causal inference: Comment on Ramsey, Spirtes and Glymour.” NeuroImage 57.2 (2011): 334-336.[15] Neyman, Jersey. ”Sur les applications de la théorie des probabilités aux experiences agricoles: Essai des principes.” Roczniki Nauk Rolniczych 10 (1923): 1-51.[16] Norris, Clayton. ”Petrarch 2: Petrarcher.” arXiv preprint arXiv: 1602.07236(2016).[17] Pai, Ping-Feng, and Chih-Sheng Lin. ”A hybrid ARIMA and support vector machines model in stock price forecasting.” Omega 33.6 (2005): 497-505.[18] Pearl, Judea. ”Graphical models, potential outcomes and causal inference: comment on Linquist and Sobel.” NeuroImage 58.3 (2011): 770.[19] Racette, Mark P., et al. ”Improving situational awareness for humanitarian logistics through predictive modeling.” Systems and Information Engineering DesignSymposium (SIEDS), 2014. IEEE, 2014.[20] Rubin, Donald B. ”Causal inference using potential outcomes: Design, modeling,decisions.” Journal of the American Statistical Association 100.469 (2005): 322-331.[21] Schrodt, Philip A. ”Automated coding of international event data using sparse parsing techniques.” annual meeting of the International Studies Association, Chicago.2001.[22] Schrodt, Philip A., and Blake Hall. ”Twenty years of the Kansas event data systemproject.” Political Methodologist 14.1 (2006): 2-6.[23] Schrodt, Philip A., John Beieler, and Muhammed Idris. ”Threeȷ sa Charm?: OpenEvent Data Coding with EL: DIABLO, PETRARCH, and the Open Event DataAlliance.” ISA Annual Convention. 2014.[24] Wager, Stefan, and Susan Athey. ”Estimation and inference of heterogeneous treatment effects using random forests.” Journal of the American Statistical Associationjust-accepted (2017).[25] Yonamine, James E. A nuanced study of political conflict using the Global Datasetsof Events Location and Tone (GDELT) dataset. Diss. The Pennsylvania State University, 2013.[26] Zaharia, Matei, et al. ”Resilient distributed datasets: A fault-tolerant abstractionfor in-memory cluster computing.” Proceedings of the 9th USENIX conference onNetworked Systems Design and Implementation. USENIX Association, 2012.[27] Zaharia, Matei, et al. ”Spark: Cluster computing with working sets.” HotCloud10.10-10 (2010): 95.[28] Zivot, Eric, and Jiahui Wang. ”Rolling Analysis of Time Series.” Modeling Financial Time Series with S-Plus®. Springer New York, 2003. 299-346. 描述 碩士
國立政治大學
資訊科學系碩士在職專班
104971002資料來源 http://thesis.lib.nccu.edu.tw/record/#G0104971002 資料類型 thesis dc.contributor.advisor 胡毓忠 zh_TW dc.contributor.advisor Hu, Yuh Jong en_US dc.contributor.author (Authors) 黃書瑋 zh_TW dc.contributor.author (Authors) Huang, Shu Wei en_US dc.creator (作者) 黃書瑋 zh_TW dc.creator (作者) Huang, Shu Wei en_US dc.date (日期) 2017 en_US dc.date.accessioned 10-Aug-2017 10:18:59 (UTC+8) - dc.date.available 10-Aug-2017 10:18:59 (UTC+8) - dc.date.issued (上傳時間) 10-Aug-2017 10:18:59 (UTC+8) - dc.identifier (Other Identifiers) G0104971002 en_US dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/111879 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 資訊科學系碩士在職專班 zh_TW dc.description (描述) 104971002 zh_TW dc.description.abstract (摘要) 於2013年正式公開的GDELT專案號稱能監控全球65種發行語言的數位新聞媒體,利用現今完善的機器學習演算法、自然語言處理及深度學習等先進人工智慧技術,將寶貴的新聞資料,萃取與轉換成具有58組欄位資訊的結構化資料,提供各領域進一步研究與應用。本研究以GDELT新聞事件資料集來開發大數據資料分析流程,並且利用Spark ML Pipeline的技術,在亞馬遜網路服務(AWS)的雲端平台上,完成以滾動式機器學習演算法,來進行以GDELT資料為主的美國標普500(S&P 500)股市指數追蹤,與特定「佔領華爾街」事件影響力的因果分析。本研究所採用的45天滾動式隨機森林模型,在歷史指數的追蹤與預測表現上,獲得了方均根差僅43.35(誤差2.12%)的優異成果;於雲端系統上的15分鐘近即時滾動式預測誤差,更是低於1.5%。在因果分析方面,本研究採用貝氏時間序列模型分析「佔領華爾街」事件影響股市的反事實指數,闡釋該事件的發生與後續效應,促使S&P 500股市指數在觀察區間中上漲116.76點。 zh_TW dc.description.tableofcontents 第一章 導論 11.1 研究動機 11.2 研究目的 21.3 研究成果 2第二章 研究背景 42.1 GDELT專案 42.1.1 CAMEO事件編碼 52.1.2 事件資料處理系統 62.1.3 資料集格式說明 82.2 資料驗證與視覺化呈現 102.2.1 社群網絡指標分析 102.2.2 視覺化分析 11第三章 相關研究 133.1 GDELT資料集研究案例 133.2 因果關係研究案例 14第四章 研究方法與架構 154.1 資料型態轉換 154.2 監督式機器學習與因果影響力分析 164.2.1 時間序列分析 164.2.2 滾動式隨機森林機器學習 174.2.3 因果關係分析 184.3 大數據Pipeline處理流程 204.3.1 Python Scikit Learn機器學習套件 214.3.2 Apache Spark ML叢集式機器學習 214.4 研究架構 23第五章 研究實作 255.1 資料前處理 255.2 模型選擇驗證 275.3 Pipeline流程 305.4 Causal Impact分析 315.5 AWS雲端服務運用 33第六章 結論與未來研究 386.1 研究結論與貢獻 386.2 研究限制與建議 38參考文獻 40 zh_TW dc.format.extent 3692699 bytes - dc.format.mimetype application/pdf - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0104971002 en_US dc.subject (關鍵詞) GDELT專案 zh_TW dc.subject (關鍵詞) 滾動式機器學習 zh_TW dc.subject (關鍵詞) 大數據分析流程 zh_TW dc.subject (關鍵詞) 新聞影響力 zh_TW dc.subject (關鍵詞) 亞馬遜網路服務 zh_TW dc.subject (關鍵詞) GDELT project en_US dc.subject (關鍵詞) Rolling-Window machine learning en_US dc.subject (關鍵詞) Big data analysis pipeline en_US dc.subject (關鍵詞) News events influences en_US dc.subject (關鍵詞) AWS en_US dc.title (題名) 建構GDELT數位新聞分析流程於Spark大數據平台:以新聞事件影響力探究美國S&P股市指數變化為例 zh_TW dc.title (題名) Establishing GDELT digital news analytics pipeline on the Spark platform : exploiting news events influences on S&P stock index variations as an example en_US dc.type (資料類型) thesis en_US dc.relation.reference (參考文獻) [1] Box, George EP, et al. Time series analysis: forecasting and control. John Wiley& Sons, 2015.[2] Breiman, Leo. ”Random forests.” Machine learning 45.1 (2001): 5-32.[3] Brodersen, Kay H., et al. ”Inferring causal impact using Bayesian structural timeseries models.” The Annals of Applied Statistics 9.1 (2015): 247-274.[4] Dietterich, Thomas G. ”Ensemble methods in machine learning.” Internationalworkshop on multiple classifier systems. Springer Berlin Heidelberg, 2000.[5] Elwert, Felix. ”Graphical causal models.” Handbook of causal analysis for socialresearch. Springer Netherlands, 2013. 245-273.[6] Gerner, Deborah J., et al. ”Conflict and mediation event observations (CAMEO):A new event data framework for the analysis of foreign policy interactions.” International Studies Association, New Orleans (2002).[7] Granger, Clive WJ. ”Investigating causal relations by econometric models andcross-spectral methods.” Econometrica: Journal of the Econometric Society(1969): 424-438.[8] Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. ”Overview of supervisedlearning.” The elements of statistical learning. Springer New York, 2009. 9-41.[9] Jiang, Lei, and Fan Mai. ”Discovering bilateral and multilateral causal events inGDELT.” international conference on social computing, behavioral-cultural modeling, and prediction, Washington, DC. 2014.[10] Kane, Michael J., et al. ”Comparison of ARIMA and Random Forest time seriesmodels for prediction of avian influenza H5N1 outbreaks.” BMC bioinformatics15.1 (2014): 276.[11] Keertipati, Swetha, et al. ”Multi-Level Analysis of Peace and Conflict Data inGDELT.” Proceedings of the MLSDA 2014 2nd Workshop on Machine Learningfor Sensory Data Analysis. ACM, 2014.[12] Kumar, Sumeet, Matthew Benigni, and Kathleen M. Carley. ”The impact of UScyber policies on cyber-attacks trend.” Intelligence and Security Informatics (ISI),2016 IEEE Conference on. IEEE, 2016.[13] Leetaru, Kalev, and Philip A. Schrodt. ”Gdelt: Global data on events, location, andtone, 1979ȉ 2012.” ISA Annual Convention. Vol. 2. No. 4. 2013.[14] Lindquist, Martin A., and Michael E. Sobel. ”Graphical models, potential outcomes and causal inference: Comment on Ramsey, Spirtes and Glymour.” NeuroImage 57.2 (2011): 334-336.[15] Neyman, Jersey. ”Sur les applications de la théorie des probabilités aux experiences agricoles: Essai des principes.” Roczniki Nauk Rolniczych 10 (1923): 1-51.[16] Norris, Clayton. ”Petrarch 2: Petrarcher.” arXiv preprint arXiv: 1602.07236(2016).[17] Pai, Ping-Feng, and Chih-Sheng Lin. ”A hybrid ARIMA and support vector machines model in stock price forecasting.” Omega 33.6 (2005): 497-505.[18] Pearl, Judea. ”Graphical models, potential outcomes and causal inference: comment on Linquist and Sobel.” NeuroImage 58.3 (2011): 770.[19] Racette, Mark P., et al. ”Improving situational awareness for humanitarian logistics through predictive modeling.” Systems and Information Engineering DesignSymposium (SIEDS), 2014. IEEE, 2014.[20] Rubin, Donald B. ”Causal inference using potential outcomes: Design, modeling,decisions.” Journal of the American Statistical Association 100.469 (2005): 322-331.[21] Schrodt, Philip A. ”Automated coding of international event data using sparse parsing techniques.” annual meeting of the International Studies Association, Chicago.2001.[22] Schrodt, Philip A., and Blake Hall. ”Twenty years of the Kansas event data systemproject.” Political Methodologist 14.1 (2006): 2-6.[23] Schrodt, Philip A., John Beieler, and Muhammed Idris. ”Threeȷ sa Charm?: OpenEvent Data Coding with EL: DIABLO, PETRARCH, and the Open Event DataAlliance.” ISA Annual Convention. 2014.[24] Wager, Stefan, and Susan Athey. ”Estimation and inference of heterogeneous treatment effects using random forests.” Journal of the American Statistical Associationjust-accepted (2017).[25] Yonamine, James E. A nuanced study of political conflict using the Global Datasetsof Events Location and Tone (GDELT) dataset. Diss. The Pennsylvania State University, 2013.[26] Zaharia, Matei, et al. ”Resilient distributed datasets: A fault-tolerant abstractionfor in-memory cluster computing.” Proceedings of the 9th USENIX conference onNetworked Systems Design and Implementation. USENIX Association, 2012.[27] Zaharia, Matei, et al. ”Spark: Cluster computing with working sets.” HotCloud10.10-10 (2010): 95.[28] Zivot, Eric, and Jiahui Wang. ”Rolling Analysis of Time Series.” Modeling Financial Time Series with S-Plus®. Springer New York, 2003. 299-346. zh_TW