學術產出-Theses

Article View/Open

Publication Export

Google ScholarTM

政大圖書館

Citation Infomation

  • No doi shows Citation Infomation
題名 建構GDELT數位新聞分析流程於Spark大數據平台:以新聞事件影響力探究美國S&P股市指數變化為例
Establishing GDELT digital news analytics pipeline on the Spark platform : exploiting news events influences on S&P stock index variations as an example
作者 黃書瑋
Huang, Shu Wei
貢獻者 胡毓忠
Hu, Yuh Jong
黃書瑋
Huang, Shu Wei
關鍵詞 GDELT專案
滾動式機器學習
大數據分析流程
新聞影響力
亞馬遜網路服務
GDELT project
Rolling-Window machine learning
Big data analysis pipeline
News events influences
AWS
日期 2017
上傳時間 10-Aug-2017 10:18:59 (UTC+8)
摘要 於2013年正式公開的GDELT專案號稱能監控全球65種發行語言的數位新聞媒體,利用現今完善的機器學習演算法、自然語言處理及深度學習等先進人工智慧技術,將寶貴的新聞資料,萃取與轉換成具有58組欄位資訊的結構化資料,提供各領域進一步研究與應用。本研究以GDELT新聞事件資料集來開發大數據資料分析流程,並且利用Spark ML Pipeline的技術,在亞馬遜網路服務(AWS)的雲端平台上,完成以滾動式機器學習演算法,來進行以GDELT資料為主的美國標普500(S&P 500)股市指數追蹤,與特定「佔領華爾街」事件影響力的因果分析。本研究所採用的45天滾動式隨機森林模型,在歷史指數的追蹤與預測表現上,獲得了方均根差僅43.35(誤差2.12%)的優異成果;於雲端系統上的15分鐘近即時滾動式預測誤差,更是低於1.5%。在因果分析方面,本研究採用貝氏時間序列模型分析「佔領華爾街」事件影響股市的反事實指數,闡釋該事件的發生與後續效應,促使S&P 500股市指數在觀察區間中上漲116.76點。
參考文獻 [1] Box, George EP, et al. Time series analysis: forecasting and control. John Wiley
& Sons, 2015.
[2] Breiman, Leo. ”Random forests.” Machine learning 45.1 (2001): 5-32.
[3] Brodersen, Kay H., et al. ”Inferring causal impact using Bayesian structural timeseries models.” The Annals of Applied Statistics 9.1 (2015): 247-274.
[4] Dietterich, Thomas G. ”Ensemble methods in machine learning.” International
workshop on multiple classifier systems. Springer Berlin Heidelberg, 2000.
[5] Elwert, Felix. ”Graphical causal models.” Handbook of causal analysis for social
research. Springer Netherlands, 2013. 245-273.
[6] Gerner, Deborah J., et al. ”Conflict and mediation event observations (CAMEO):
A new event data framework for the analysis of foreign policy interactions.” International Studies Association, New Orleans (2002).
[7] Granger, Clive WJ. ”Investigating causal relations by econometric models and
cross-spectral methods.” Econometrica: Journal of the Econometric Society
(1969): 424-438.
[8] Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. ”Overview of supervised
learning.” The elements of statistical learning. Springer New York, 2009. 9-41.
[9] Jiang, Lei, and Fan Mai. ”Discovering bilateral and multilateral causal events in
GDELT.” international conference on social computing, behavioral-cultural modeling, and prediction, Washington, DC. 2014.
[10] Kane, Michael J., et al. ”Comparison of ARIMA and Random Forest time series
models for prediction of avian influenza H5N1 outbreaks.” BMC bioinformatics
15.1 (2014): 276.
[11] Keertipati, Swetha, et al. ”Multi-Level Analysis of Peace and Conflict Data in
GDELT.” Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning
for Sensory Data Analysis. ACM, 2014.
[12] Kumar, Sumeet, Matthew Benigni, and Kathleen M. Carley. ”The impact of US
cyber policies on cyber-attacks trend.” Intelligence and Security Informatics (ISI),
2016 IEEE Conference on. IEEE, 2016.
[13] Leetaru, Kalev, and Philip A. Schrodt. ”Gdelt: Global data on events, location, and
tone, 1979ȉ 2012.” ISA Annual Convention. Vol. 2. No. 4. 2013.
[14] Lindquist, Martin A., and Michael E. Sobel. ”Graphical models, potential outcomes and causal inference: Comment on Ramsey, Spirtes and Glymour.” NeuroImage 57.2 (2011): 334-336.
[15] Neyman, Jersey. ”Sur les applications de la théorie des probabilités aux experiences agricoles: Essai des principes.” Roczniki Nauk Rolniczych 10 (1923): 1-51.
[16] Norris, Clayton. ”Petrarch 2: Petrarcher.” arXiv preprint arXiv: 1602.07236
(2016).
[17] Pai, Ping-Feng, and Chih-Sheng Lin. ”A hybrid ARIMA and support vector machines model in stock price forecasting.” Omega 33.6 (2005): 497-505.
[18] Pearl, Judea. ”Graphical models, potential outcomes and causal inference: comment on Linquist and Sobel.” NeuroImage 58.3 (2011): 770.
[19] Racette, Mark P., et al. ”Improving situational awareness for humanitarian logistics through predictive modeling.” Systems and Information Engineering Design
Symposium (SIEDS), 2014. IEEE, 2014.
[20] Rubin, Donald B. ”Causal inference using potential outcomes: Design, modeling,
decisions.” Journal of the American Statistical Association 100.469 (2005): 322-
331.
[21] Schrodt, Philip A. ”Automated coding of international event data using sparse parsing techniques.” annual meeting of the International Studies Association, Chicago.
2001.
[22] Schrodt, Philip A., and Blake Hall. ”Twenty years of the Kansas event data system
project.” Political Methodologist 14.1 (2006): 2-6.
[23] Schrodt, Philip A., John Beieler, and Muhammed Idris. ”Threeȷ sa Charm?: Open
Event Data Coding with EL: DIABLO, PETRARCH, and the Open Event Data
Alliance.” ISA Annual Convention. 2014.
[24] Wager, Stefan, and Susan Athey. ”Estimation and inference of heterogeneous treatment effects using random forests.” Journal of the American Statistical Association
just-accepted (2017).
[25] Yonamine, James E. A nuanced study of political conflict using the Global Datasets
of Events Location and Tone (GDELT) dataset. Diss. The Pennsylvania State University, 2013.
[26] Zaharia, Matei, et al. ”Resilient distributed datasets: A fault-tolerant abstraction
for in-memory cluster computing.” Proceedings of the 9th USENIX conference on
Networked Systems Design and Implementation. USENIX Association, 2012.
[27] Zaharia, Matei, et al. ”Spark: Cluster computing with working sets.” HotCloud
10.10-10 (2010): 95.
[28] Zivot, Eric, and Jiahui Wang. ”Rolling Analysis of Time Series.” Modeling Financial Time Series with S-Plus®. Springer New York, 2003. 299-346.
描述 碩士
國立政治大學
資訊科學系碩士在職專班
104971002
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0104971002
資料類型 thesis
dc.contributor.advisor 胡毓忠zh_TW
dc.contributor.advisor Hu, Yuh Jongen_US
dc.contributor.author (Authors) 黃書瑋zh_TW
dc.contributor.author (Authors) Huang, Shu Weien_US
dc.creator (作者) 黃書瑋zh_TW
dc.creator (作者) Huang, Shu Weien_US
dc.date (日期) 2017en_US
dc.date.accessioned 10-Aug-2017 10:18:59 (UTC+8)-
dc.date.available 10-Aug-2017 10:18:59 (UTC+8)-
dc.date.issued (上傳時間) 10-Aug-2017 10:18:59 (UTC+8)-
dc.identifier (Other Identifiers) G0104971002en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/111879-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊科學系碩士在職專班zh_TW
dc.description (描述) 104971002zh_TW
dc.description.abstract (摘要) 於2013年正式公開的GDELT專案號稱能監控全球65種發行語言的數位新聞媒體,利用現今完善的機器學習演算法、自然語言處理及深度學習等先進人工智慧技術,將寶貴的新聞資料,萃取與轉換成具有58組欄位資訊的結構化資料,提供各領域進一步研究與應用。本研究以GDELT新聞事件資料集來開發大數據資料分析流程,並且利用Spark ML Pipeline的技術,在亞馬遜網路服務(AWS)的雲端平台上,完成以滾動式機器學習演算法,來進行以GDELT資料為主的美國標普500(S&P 500)股市指數追蹤,與特定「佔領華爾街」事件影響力的因果分析。本研究所採用的45天滾動式隨機森林模型,在歷史指數的追蹤與預測表現上,獲得了方均根差僅43.35(誤差2.12%)的優異成果;於雲端系統上的15分鐘近即時滾動式預測誤差,更是低於1.5%。在因果分析方面,本研究採用貝氏時間序列模型分析「佔領華爾街」事件影響股市的反事實指數,闡釋該事件的發生與後續效應,促使S&P 500股市指數在觀察區間中上漲116.76點。zh_TW
dc.description.tableofcontents 第一章 導論 1
1.1 研究動機 1
1.2 研究目的 2
1.3 研究成果 2
第二章 研究背景 4
2.1 GDELT專案 4
2.1.1 CAMEO事件編碼 5
2.1.2 事件資料處理系統 6
2.1.3 資料集格式說明 8
2.2 資料驗證與視覺化呈現 10
2.2.1 社群網絡指標分析 10
2.2.2 視覺化分析 11
第三章 相關研究 13
3.1 GDELT資料集研究案例 13
3.2 因果關係研究案例 14
第四章 研究方法與架構 15
4.1 資料型態轉換 15
4.2 監督式機器學習與因果影響力分析 16
4.2.1 時間序列分析 16
4.2.2 滾動式隨機森林機器學習 17
4.2.3 因果關係分析 18
4.3 大數據Pipeline處理流程 20
4.3.1 Python Scikit Learn機器學習套件 21
4.3.2 Apache Spark ML叢集式機器學習 21
4.4 研究架構 23
第五章 研究實作 25
5.1 資料前處理 25
5.2 模型選擇驗證 27
5.3 Pipeline流程 30
5.4 Causal Impact分析 31
5.5 AWS雲端服務運用 33
第六章 結論與未來研究 38
6.1 研究結論與貢獻 38
6.2 研究限制與建議 38
參考文獻 40
zh_TW
dc.format.extent 3692699 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0104971002en_US
dc.subject (關鍵詞) GDELT專案zh_TW
dc.subject (關鍵詞) 滾動式機器學習zh_TW
dc.subject (關鍵詞) 大數據分析流程zh_TW
dc.subject (關鍵詞) 新聞影響力zh_TW
dc.subject (關鍵詞) 亞馬遜網路服務zh_TW
dc.subject (關鍵詞) GDELT projecten_US
dc.subject (關鍵詞) Rolling-Window machine learningen_US
dc.subject (關鍵詞) Big data analysis pipelineen_US
dc.subject (關鍵詞) News events influencesen_US
dc.subject (關鍵詞) AWSen_US
dc.title (題名) 建構GDELT數位新聞分析流程於Spark大數據平台:以新聞事件影響力探究美國S&P股市指數變化為例zh_TW
dc.title (題名) Establishing GDELT digital news analytics pipeline on the Spark platform : exploiting news events influences on S&P stock index variations as an exampleen_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) [1] Box, George EP, et al. Time series analysis: forecasting and control. John Wiley
& Sons, 2015.
[2] Breiman, Leo. ”Random forests.” Machine learning 45.1 (2001): 5-32.
[3] Brodersen, Kay H., et al. ”Inferring causal impact using Bayesian structural timeseries models.” The Annals of Applied Statistics 9.1 (2015): 247-274.
[4] Dietterich, Thomas G. ”Ensemble methods in machine learning.” International
workshop on multiple classifier systems. Springer Berlin Heidelberg, 2000.
[5] Elwert, Felix. ”Graphical causal models.” Handbook of causal analysis for social
research. Springer Netherlands, 2013. 245-273.
[6] Gerner, Deborah J., et al. ”Conflict and mediation event observations (CAMEO):
A new event data framework for the analysis of foreign policy interactions.” International Studies Association, New Orleans (2002).
[7] Granger, Clive WJ. ”Investigating causal relations by econometric models and
cross-spectral methods.” Econometrica: Journal of the Econometric Society
(1969): 424-438.
[8] Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. ”Overview of supervised
learning.” The elements of statistical learning. Springer New York, 2009. 9-41.
[9] Jiang, Lei, and Fan Mai. ”Discovering bilateral and multilateral causal events in
GDELT.” international conference on social computing, behavioral-cultural modeling, and prediction, Washington, DC. 2014.
[10] Kane, Michael J., et al. ”Comparison of ARIMA and Random Forest time series
models for prediction of avian influenza H5N1 outbreaks.” BMC bioinformatics
15.1 (2014): 276.
[11] Keertipati, Swetha, et al. ”Multi-Level Analysis of Peace and Conflict Data in
GDELT.” Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning
for Sensory Data Analysis. ACM, 2014.
[12] Kumar, Sumeet, Matthew Benigni, and Kathleen M. Carley. ”The impact of US
cyber policies on cyber-attacks trend.” Intelligence and Security Informatics (ISI),
2016 IEEE Conference on. IEEE, 2016.
[13] Leetaru, Kalev, and Philip A. Schrodt. ”Gdelt: Global data on events, location, and
tone, 1979ȉ 2012.” ISA Annual Convention. Vol. 2. No. 4. 2013.
[14] Lindquist, Martin A., and Michael E. Sobel. ”Graphical models, potential outcomes and causal inference: Comment on Ramsey, Spirtes and Glymour.” NeuroImage 57.2 (2011): 334-336.
[15] Neyman, Jersey. ”Sur les applications de la théorie des probabilités aux experiences agricoles: Essai des principes.” Roczniki Nauk Rolniczych 10 (1923): 1-51.
[16] Norris, Clayton. ”Petrarch 2: Petrarcher.” arXiv preprint arXiv: 1602.07236
(2016).
[17] Pai, Ping-Feng, and Chih-Sheng Lin. ”A hybrid ARIMA and support vector machines model in stock price forecasting.” Omega 33.6 (2005): 497-505.
[18] Pearl, Judea. ”Graphical models, potential outcomes and causal inference: comment on Linquist and Sobel.” NeuroImage 58.3 (2011): 770.
[19] Racette, Mark P., et al. ”Improving situational awareness for humanitarian logistics through predictive modeling.” Systems and Information Engineering Design
Symposium (SIEDS), 2014. IEEE, 2014.
[20] Rubin, Donald B. ”Causal inference using potential outcomes: Design, modeling,
decisions.” Journal of the American Statistical Association 100.469 (2005): 322-
331.
[21] Schrodt, Philip A. ”Automated coding of international event data using sparse parsing techniques.” annual meeting of the International Studies Association, Chicago.
2001.
[22] Schrodt, Philip A., and Blake Hall. ”Twenty years of the Kansas event data system
project.” Political Methodologist 14.1 (2006): 2-6.
[23] Schrodt, Philip A., John Beieler, and Muhammed Idris. ”Threeȷ sa Charm?: Open
Event Data Coding with EL: DIABLO, PETRARCH, and the Open Event Data
Alliance.” ISA Annual Convention. 2014.
[24] Wager, Stefan, and Susan Athey. ”Estimation and inference of heterogeneous treatment effects using random forests.” Journal of the American Statistical Association
just-accepted (2017).
[25] Yonamine, James E. A nuanced study of political conflict using the Global Datasets
of Events Location and Tone (GDELT) dataset. Diss. The Pennsylvania State University, 2013.
[26] Zaharia, Matei, et al. ”Resilient distributed datasets: A fault-tolerant abstraction
for in-memory cluster computing.” Proceedings of the 9th USENIX conference on
Networked Systems Design and Implementation. USENIX Association, 2012.
[27] Zaharia, Matei, et al. ”Spark: Cluster computing with working sets.” HotCloud
10.10-10 (2010): 95.
[28] Zivot, Eric, and Jiahui Wang. ”Rolling Analysis of Time Series.” Modeling Financial Time Series with S-Plus®. Springer New York, 2003. 299-346.
zh_TW