建構GDELT數位新聞分析流程於Spark大數據平台：以新聞事件影響力探究美國S&P股市指數變化為例

學術產出-學位論文

文章檢視/開啟

pdf(243)

書目匯出

Google Scholar^TM

題名	建構GDELT數位新聞分析流程於Spark大數據平台：以新聞事件影響力探究美國S&P股市指數變化為例 Establishing GDELT digital news analytics pipeline on the Spark platform : exploiting news events influences on S&P stock index variations as an example
作者	黃書瑋 Huang, Shu Wei
貢獻者	胡毓忠 Hu, Yuh Jong 黃書瑋 Huang, Shu Wei
關鍵詞	GDELT專案滾動式機器學習大數據分析流程新聞影響力亞馬遜網路服務 GDELT project Rolling-Window machine learning Big data analysis pipeline News events influences AWS
日期	2017
上傳時間	10-八月-2017 10:18:59 (UTC+8)
摘要	於2013年正式公開的GDELT專案號稱能監控全球65種發行語言的數位新聞媒體，利用現今完善的機器學習演算法、自然語言處理及深度學習等先進人工智慧技術，將寶貴的新聞資料，萃取與轉換成具有58組欄位資訊的結構化資料，提供各領域進一步研究與應用。本研究以GDELT新聞事件資料集來開發大數據資料分析流程，並且利用Spark ML Pipeline的技術，在亞馬遜網路服務（AWS）的雲端平台上，完成以滾動式機器學習演算法，來進行以GDELT資料為主的美國標普500（S&P 500）股市指數追蹤，與特定「佔領華爾街」事件影響力的因果分析。本研究所採用的45天滾動式隨機森林模型，在歷史指數的追蹤與預測表現上，獲得了方均根差僅43.35（誤差2.12%）的優異成果；於雲端系統上的15分鐘近即時滾動式預測誤差，更是低於1.5%。在因果分析方面，本研究採用貝氏時間序列模型分析「佔領華爾街」事件影響股市的反事實指數，闡釋該事件的發生與後續效應，促使S&P 500股市指數在觀察區間中上漲116.76點。
參考文獻	[1] Box, George EP, et al. Time series analysis: forecasting and control. John Wiley & Sons, 2015. [2] Breiman, Leo. ”Random forests.” Machine learning 45.1 (2001): 5-32. [3] Brodersen, Kay H., et al. ”Inferring causal impact using Bayesian structural timeseries models.” The Annals of Applied Statistics 9.1 (2015): 247-274. [4] Dietterich, Thomas G. ”Ensemble methods in machine learning.” International workshop on multiple classifier systems. Springer Berlin Heidelberg, 2000. [5] Elwert, Felix. ”Graphical causal models.” Handbook of causal analysis for social research. Springer Netherlands, 2013. 245-273. [6] Gerner, Deborah J., et al. ”Conflict and mediation event observations (CAMEO): A new event data framework for the analysis of foreign policy interactions.” International Studies Association, New Orleans (2002). [7] Granger, Clive WJ. ”Investigating causal relations by econometric models and cross-spectral methods.” Econometrica: Journal of the Econometric Society (1969): 424-438. [8] Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. ”Overview of supervised learning.” The elements of statistical learning. Springer New York, 2009. 9-41. [9] Jiang, Lei, and Fan Mai. ”Discovering bilateral and multilateral causal events in GDELT.” international conference on social computing, behavioral-cultural modeling, and prediction, Washington, DC. 2014. [10] Kane, Michael J., et al. ”Comparison of ARIMA and Random Forest time series models for prediction of avian influenza H5N1 outbreaks.” BMC bioinformatics 15.1 (2014): 276. [11] Keertipati, Swetha, et al. ”Multi-Level Analysis of Peace and Conflict Data in GDELT.” Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis. ACM, 2014. [12] Kumar, Sumeet, Matthew Benigni, and Kathleen M. Carley. ”The impact of US cyber policies on cyber-attacks trend.” Intelligence and Security Informatics (ISI), 2016 IEEE Conference on. IEEE, 2016. [13] Leetaru, Kalev, and Philip A. Schrodt. ”Gdelt: Global data on events, location, and tone, 1979ȉ 2012.” ISA Annual Convention. Vol. 2. No. 4. 2013. [14] Lindquist, Martin A., and Michael E. Sobel. ”Graphical models, potential outcomes and causal inference: Comment on Ramsey, Spirtes and Glymour.” NeuroImage 57.2 (2011): 334-336. [15] Neyman, Jersey. ”Sur les applications de la théorie des probabilités aux experiences agricoles: Essai des principes.” Roczniki Nauk Rolniczych 10 (1923): 1-51. [16] Norris, Clayton. ”Petrarch 2: Petrarcher.” arXiv preprint arXiv: 1602.07236 (2016). [17] Pai, Ping-Feng, and Chih-Sheng Lin. ”A hybrid ARIMA and support vector machines model in stock price forecasting.” Omega 33.6 (2005): 497-505. [18] Pearl, Judea. ”Graphical models, potential outcomes and causal inference: comment on Linquist and Sobel.” NeuroImage 58.3 (2011): 770. [19] Racette, Mark P., et al. ”Improving situational awareness for humanitarian logistics through predictive modeling.” Systems and Information Engineering Design Symposium (SIEDS), 2014. IEEE, 2014. [20] Rubin, Donald B. ”Causal inference using potential outcomes: Design, modeling, decisions.” Journal of the American Statistical Association 100.469 (2005): 322- 331. [21] Schrodt, Philip A. ”Automated coding of international event data using sparse parsing techniques.” annual meeting of the International Studies Association, Chicago. 2001. [22] Schrodt, Philip A., and Blake Hall. ”Twenty years of the Kansas event data system project.” Political Methodologist 14.1 (2006): 2-6. [23] Schrodt, Philip A., John Beieler, and Muhammed Idris. ”Threeȷ sa Charm?: Open Event Data Coding with EL: DIABLO, PETRARCH, and the Open Event Data Alliance.” ISA Annual Convention. 2014. [24] Wager, Stefan, and Susan Athey. ”Estimation and inference of heterogeneous treatment effects using random forests.” Journal of the American Statistical Association just-accepted (2017). [25] Yonamine, James E. A nuanced study of political conflict using the Global Datasets of Events Location and Tone (GDELT) dataset. Diss. The Pennsylvania State University, 2013. [26] Zaharia, Matei, et al. ”Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing.” Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2012. [27] Zaharia, Matei, et al. ”Spark: Cluster computing with working sets.” HotCloud 10.10-10 (2010): 95. [28] Zivot, Eric, and Jiahui Wang. ”Rolling Analysis of Time Series.” Modeling Financial Time Series with S-Plus®. Springer New York, 2003. 299-346.
描述	碩士國立政治大學資訊科學系碩士在職專班 104971002
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0104971002
資料類型	thesis

dc.contributor.advisor	胡毓忠	zh_TW
dc.contributor.advisor	Hu, Yuh Jong	en_US
dc.contributor.author (作者)	黃書瑋	zh_TW
dc.contributor.author (作者)	Huang, Shu Wei	en_US
dc.creator (作者)	黃書瑋	zh_TW
dc.creator (作者)	Huang, Shu Wei	en_US
dc.date (日期)	2017	en_US
dc.date.accessioned	10-八月-2017 10:18:59 (UTC+8)	-
dc.date.available	10-八月-2017 10:18:59 (UTC+8)	-
dc.date.issued (上傳時間)	10-八月-2017 10:18:59 (UTC+8)	-
dc.identifier (其他識別碼)	G0104971002	en_US
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/111879	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	資訊科學系碩士在職專班	zh_TW
dc.description (描述)	104971002	zh_TW
dc.description.abstract (摘要)	於2013年正式公開的GDELT專案號稱能監控全球65種發行語言的數位新聞媒體，利用現今完善的機器學習演算法、自然語言處理及深度學習等先進人工智慧技術，將寶貴的新聞資料，萃取與轉換成具有58組欄位資訊的結構化資料，提供各領域進一步研究與應用。本研究以GDELT新聞事件資料集來開發大數據資料分析流程，並且利用Spark ML Pipeline的技術，在亞馬遜網路服務（AWS）的雲端平台上，完成以滾動式機器學習演算法，來進行以GDELT資料為主的美國標普500（S&P 500）股市指數追蹤，與特定「佔領華爾街」事件影響力的因果分析。本研究所採用的45天滾動式隨機森林模型，在歷史指數的追蹤與預測表現上，獲得了方均根差僅43.35（誤差2.12%）的優異成果；於雲端系統上的15分鐘近即時滾動式預測誤差，更是低於1.5%。在因果分析方面，本研究採用貝氏時間序列模型分析「佔領華爾街」事件影響股市的反事實指數，闡釋該事件的發生與後續效應，促使S&P 500股市指數在觀察區間中上漲116.76點。	zh_TW
dc.description.tableofcontents	第一章導論 1 1.1 研究動機 1 1.2 研究目的 2 1.3 研究成果 2 第二章研究背景 4 2.1 GDELT專案 4 2.1.1 CAMEO事件編碼 5 2.1.2 事件資料處理系統 6 2.1.3 資料集格式說明 8 2.2 資料驗證與視覺化呈現 10 2.2.1 社群網絡指標分析 10 2.2.2 視覺化分析 11 第三章相關研究 13 3.1 GDELT資料集研究案例 13 3.2 因果關係研究案例 14 第四章研究方法與架構 15 4.1 資料型態轉換 15 4.2 監督式機器學習與因果影響力分析 16 4.2.1 時間序列分析 16 4.2.2 滾動式隨機森林機器學習 17 4.2.3 因果關係分析 18 4.3 大數據Pipeline處理流程 20 4.3.1 Python Scikit Learn機器學習套件 21 4.3.2 Apache Spark ML叢集式機器學習 21 4.4 研究架構 23 第五章研究實作 25 5.1 資料前處理 25 5.2 模型選擇驗證 27 5.3 Pipeline流程 30 5.4 Causal Impact分析 31 5.5 AWS雲端服務運用 33 第六章結論與未來研究 38 6.1 研究結論與貢獻 38 6.2 研究限制與建議 38 參考文獻 40	zh_TW
dc.format.extent	3692699 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0104971002	en_US
dc.subject (關鍵詞)	GDELT專案	zh_TW
dc.subject (關鍵詞)	滾動式機器學習	zh_TW
dc.subject (關鍵詞)	大數據分析流程	zh_TW
dc.subject (關鍵詞)	新聞影響力	zh_TW
dc.subject (關鍵詞)	亞馬遜網路服務	zh_TW
dc.subject (關鍵詞)	GDELT project	en_US
dc.subject (關鍵詞)	Rolling-Window machine learning	en_US
dc.subject (關鍵詞)	Big data analysis pipeline	en_US
dc.subject (關鍵詞)	News events influences	en_US
dc.subject (關鍵詞)	AWS	en_US
dc.title (題名)	建構GDELT數位新聞分析流程於Spark大數據平台：以新聞事件影響力探究美國S&P股市指數變化為例	zh_TW
dc.title (題名)	Establishing GDELT digital news analytics pipeline on the Spark platform : exploiting news events influences on S&P stock index variations as an example	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	[1] Box, George EP, et al. Time series analysis: forecasting and control. John Wiley & Sons, 2015. [2] Breiman, Leo. ”Random forests.” Machine learning 45.1 (2001): 5-32. [3] Brodersen, Kay H., et al. ”Inferring causal impact using Bayesian structural timeseries models.” The Annals of Applied Statistics 9.1 (2015): 247-274. [4] Dietterich, Thomas G. ”Ensemble methods in machine learning.” International workshop on multiple classifier systems. Springer Berlin Heidelberg, 2000. [5] Elwert, Felix. ”Graphical causal models.” Handbook of causal analysis for social research. Springer Netherlands, 2013. 245-273. [6] Gerner, Deborah J., et al. ”Conflict and mediation event observations (CAMEO): A new event data framework for the analysis of foreign policy interactions.” International Studies Association, New Orleans (2002). [7] Granger, Clive WJ. ”Investigating causal relations by econometric models and cross-spectral methods.” Econometrica: Journal of the Econometric Society (1969): 424-438. [8] Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. ”Overview of supervised learning.” The elements of statistical learning. Springer New York, 2009. 9-41. [9] Jiang, Lei, and Fan Mai. ”Discovering bilateral and multilateral causal events in GDELT.” international conference on social computing, behavioral-cultural modeling, and prediction, Washington, DC. 2014. [10] Kane, Michael J., et al. ”Comparison of ARIMA and Random Forest time series models for prediction of avian influenza H5N1 outbreaks.” BMC bioinformatics 15.1 (2014): 276. [11] Keertipati, Swetha, et al. ”Multi-Level Analysis of Peace and Conflict Data in GDELT.” Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis. ACM, 2014. [12] Kumar, Sumeet, Matthew Benigni, and Kathleen M. Carley. ”The impact of US cyber policies on cyber-attacks trend.” Intelligence and Security Informatics (ISI), 2016 IEEE Conference on. IEEE, 2016. [13] Leetaru, Kalev, and Philip A. Schrodt. ”Gdelt: Global data on events, location, and tone, 1979ȉ 2012.” ISA Annual Convention. Vol. 2. No. 4. 2013. [14] Lindquist, Martin A., and Michael E. Sobel. ”Graphical models, potential outcomes and causal inference: Comment on Ramsey, Spirtes and Glymour.” NeuroImage 57.2 (2011): 334-336. [15] Neyman, Jersey. ”Sur les applications de la théorie des probabilités aux experiences agricoles: Essai des principes.” Roczniki Nauk Rolniczych 10 (1923): 1-51. [16] Norris, Clayton. ”Petrarch 2: Petrarcher.” arXiv preprint arXiv: 1602.07236 (2016). [17] Pai, Ping-Feng, and Chih-Sheng Lin. ”A hybrid ARIMA and support vector machines model in stock price forecasting.” Omega 33.6 (2005): 497-505. [18] Pearl, Judea. ”Graphical models, potential outcomes and causal inference: comment on Linquist and Sobel.” NeuroImage 58.3 (2011): 770. [19] Racette, Mark P., et al. ”Improving situational awareness for humanitarian logistics through predictive modeling.” Systems and Information Engineering Design Symposium (SIEDS), 2014. IEEE, 2014. [20] Rubin, Donald B. ”Causal inference using potential outcomes: Design, modeling, decisions.” Journal of the American Statistical Association 100.469 (2005): 322- 331. [21] Schrodt, Philip A. ”Automated coding of international event data using sparse parsing techniques.” annual meeting of the International Studies Association, Chicago. 2001. [22] Schrodt, Philip A., and Blake Hall. ”Twenty years of the Kansas event data system project.” Political Methodologist 14.1 (2006): 2-6. [23] Schrodt, Philip A., John Beieler, and Muhammed Idris. ”Threeȷ sa Charm?: Open Event Data Coding with EL: DIABLO, PETRARCH, and the Open Event Data Alliance.” ISA Annual Convention. 2014. [24] Wager, Stefan, and Susan Athey. ”Estimation and inference of heterogeneous treatment effects using random forests.” Journal of the American Statistical Association just-accepted (2017). [25] Yonamine, James E. A nuanced study of political conflict using the Global Datasets of Events Location and Tone (GDELT) dataset. Diss. The Pennsylvania State University, 2013. [26] Zaharia, Matei, et al. ”Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing.” Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2012. [27] Zaharia, Matei, et al. ”Spark: Cluster computing with working sets.” HotCloud 10.10-10 (2010): 95. [28] Zivot, Eric, and Jiahui Wang. ”Rolling Analysis of Time Series.” Modeling Financial Time Series with S-Plus®. Springer New York, 2003. 299-346.	zh_TW

學術產出-學位論文

文章檢視/開啟

書目匯出

Google ScholarTM

政大圖書館

引文資訊

TAIR相關學術產出

Google Scholar^TM