Publications-Theses
Article View/Open
Publication Export
-
題名 透過文字探勘預測台股報酬
Predicting Taiwan Stocks Returns with Text Data作者 郭亭佑
Kuo, Ting-You貢獻者 翁久幸<br>林士貴
Weng, Chiu-Hsing<br>Lin, Shih-Kuei
郭亭佑
Kuo, Ting-You關鍵詞 非結構化數據
文字探勘
股票新聞
機器學習
預測股票報酬
情緒分析
效率市場假說
超額報酬
Unstructured Data
Text Mining
Stock News
Machine Learning
Predict Stock Returns
Sentiment Analysis
Efficient-Market Hypothesis
Abnormal Returns日期 2021 上傳時間 4-Aug-2021 14:43:11 (UTC+8) 摘要 近年來非結構化數據成長快速,因而引發多位學者針對新聞媒體對於股票報酬之影響此類議題進行研究分析。新聞為一般投資人進行交易行為時,最為普遍接觸之「公開資訊」。然而,新聞文章不若財報資訊中有明確數據資料供投資人研究分析後,作為其投資之參考依據。本研究欲透過文字探勘方法獲取台股新聞情緒信息,並利用新聞情緒分數預測台股報酬。本文依據 Ke, Kelly & Xiu (2019) 提出之文字探勘方法建構台股新聞情緒分數模型(Taiwan Stocks Sentiment Extraction via Screening and Topic Modeling, 台股SESTM),我們發現該方法特別適合用於分析新聞文章與股價走勢之間的變動關係,因此本研究欲將該文字探勘方法拓展至臺灣股票市場,並用於實證臺灣效率市場假說。我們發現使用台股SESTM所估算之新聞情緒分數,於臺灣股票市場建構投資組合交易策略同樣有巨大經濟效益,而該情緒分數對於個股報酬有顯著的預測能力及解釋力。若比較美國與台股SESTM交易策略績效表現,可發現台股SESTM對於新聞發佈前之股票報酬有較高的預測能力。同時也發現,儘管台股SESTM對於股票報酬之預測能力顯著有效,但我們透過評估績效發現,新聞對於臺灣投資人決策行為之影響與美國是顯著不同的,這些結果均符合我們對於臺灣股票市場的經濟直觀。我們期待此研究所建構之台股SESTM能夠幫助臺灣財務文字探勘領域建立研究基底。
In recent years, unstructured data has grown rapidly, which has triggered many scholars to conduct research and analysis on the impact of news media on stock price returns. News article is the most common and accessible “open information” by investors when they conduct transactions. However, news articles, unlike financial report or stock price, news articles cannot be converted to specific numerical data as a reference basis for investment. Our research intends to obtain sentiment information from Taiwan stocks news through text-mining and use news sentiment scores to predict Taiwan stocks` returns. Our research is based on the text-mining methodology introduce by Ke, Kelly & Xiu (2019) to construct a Taiwan stock news sentiment model (Taiwan Stocks Sentiment Extraction via Screening and Topic Modeling, Taiwan SESTM). We found that this methodology is particularly suitable for analyzing the relationship between news articles and stock price trends. Therefore, this study intends to extend this text-mining methodology to the Taiwan stock market and use the empirical analysis of Taiwan`s efficiency-market hypothesis by news articles. We found that using the news sentiment score estimated by Taiwan SESTM to construct a portfolio trading strategy in the Taiwan stock market also has huge economic benefits, and the sentiment score is significantly effective on predict stock returns and explain their correlation. We compare the performance of the United States and Taiwan SESTM trading strategies, we found that Taiwan SESTM has a higher predictive ability for stock price returns before the news articles release. At the same time, we also found the impact of news on the decision making of Taiwanese investors is significantly different with United States by evaluate our portfolio performance. These results are in line with our economic intuition about the Taiwan stock market. We hope that the Taiwan SESTM constructed by this research can help establish a research base in the field of financial text-mining in Taiwan.參考文獻 1. 李昱穎. (2019). 新聞輿情分析在台灣股票市場之應用: 文字轉向量與動能策略. 政治大學金融學系學位論文, 1-40.2. 陳信宏, 陳昱志,& 鄭舜仁.(2006). 以時間數列模型檢定台灣股票市場弱式效率性之研究. 管理科學與統計決策, 3(4), 8-17.3. 鍾任明, 李維平, & 吳澤民. (2005). 運用文字探勘於日內股價漲跌趨勢預測之研究 (Doctoral dissertation, 撰者).4. Azar, P. D., & Lo, A. W. (2016). The wisdom of Twitter crowds: Predicting stock market reactions to FOMC meetings via Twitter feeds. The Journal of Portfolio Management, 42(5), 123-134.5. Alvarez-Ramirez, J., Rodriguez, E., & Espinosa-Paredes, G. (2012). Is the US stock market becoming weakly efficient over time? Evidence from 80-year-long data. Physica A: Statistical Mechanics and its Applications, 391(22), 5643-5647.6. Bernard, V. L., & Thomas, J. K. (1990). Evidence that stock prices do not fully reflect the implications of current earnings for future earnings. Journal of Accounting and Economics, 13(4), 305-340.7. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12, 2493-2537.8. Cowles 3rd, A. (1933). Can stock market forecasters forecast?. Econometrica: Journal of the Econometric Society, 309-324.9. Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., & Salakhutdinov, R. (2019). Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860.10. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.11. Fama, E. F. (1970). Efficient capital markets: A review of theory and empirical work. The Journal of Finance, 25(2), 383-417.12. Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensioal feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5), 849-911.13. Gehring, J., Auli, M., Grangier, D., Yarats, D., & Dauphin, Y. N. (2017, July). Convolutional sequence to sequence learning. In International Conference on Machine Learning (pp. 1243-1252). PMLR.14. Heston, S. L., & Sinha, N. R. (2017). News vs. sentiment: Predicting stock returns from news stories. Financial Analysts Journal, 73(3), 67-83.15. Hutchins, R. M. (1954). Great books. Western World.16. Jegadeesh, N., & Titman, S. (1993). Returns to buying winners and selling losers: Implications for stock market efficiency. The Journal of Finance, 48(1), 65-91.17. Jegadeesh, N., & Wu, D. (2013). Word power: A new approach for content analysis. Journal of Financial Economics, 110(3), 712-729.18. Kalchbrenner, N., Grefenstette, E., & Blunsom, P. (2014). A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188.19. Ke, Z. T., Kelly, B. T., & Xiu, D. (2019). Predicting returns with text data (No. w26186). National Bureau of Economic Research.20. Lakonishok, J., & Vermaelen, T. (1990). Anomalous price behavior around repurchase tender offers. The Journal of Finance, 45(2), 455-477.21. Le, Q., & Mikolov, T. (2014, June). Distributed representations of sentences and documents. In International Conference on Machine Learning (pp. 1188-1196). PMLR.22. Loper, E., & Bird, S. (2002). NLTK: the natural language toolkit. arXiv preprint cs/0205028.23. Loughran, T., & McDonald, B. (2011). When is a liability not a liability? Textual analysis, dictionaries, and 10‐Ks. The Journal of Finance, 66(1), 35-65.24. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, 26, 3111-3119.25. Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Polosukhin, I. (2017). Attention is all you need. arXiv preprint arXiv:1706.03762.26. Ritter, J. R. (1991). The long‐run performance of initial public offerings. The Journal of Finance, 46(1), 3-27.27. Spiess, D. K., & Affleck-Graves, J. (1995). Underperformance in long-run stock returns following seasoned equity offerings. Journal of Financial Economics, 38(3), 243-267.28. Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning withneural networks. arXiv preprint arXiv:1409.3215.29. Tetlock, P. C. (2007). Giving content to investor sentiment: The role of media in the stock market. The Journal of Finance, 62(3), 1139-1168.30. Tetlock, P. C. (2014). Information transmission in finance. Annual Review of Financial Economics, 6(1), 365-384.31. Turing, I. B. A. (1950). Computing machinery and intelligence-AM Turing. Mind, 59(236), 433.32. Wilson, D. S. (1975). A theory of group selection. Proceedings of the National Academy of Sciences, 72(1), 143-146.33. Yang, B., Yih, W. T., He, X., Gao, J., & Deng, L. (2014). Embedding entities and relations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575.34. Zettlemoyer, L. (2018). Deep contextualized word representations. arXiv preprint arXiv:1802.05365.35. Zhang, Y., & Wallace, B. (2015). A sensitivity analysis of (and practitioners` guide to) convolutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820. 描述 碩士
國立政治大學
統計學系
108354023資料來源 http://thesis.lib.nccu.edu.tw/record/#G0108354023 資料類型 thesis dc.contributor.advisor 翁久幸<br>林士貴 zh_TW dc.contributor.advisor Weng, Chiu-Hsing<br>Lin, Shih-Kuei en_US dc.contributor.author (Authors) 郭亭佑 zh_TW dc.contributor.author (Authors) Kuo, Ting-You en_US dc.creator (作者) 郭亭佑 zh_TW dc.creator (作者) Kuo, Ting-You en_US dc.date (日期) 2021 en_US dc.date.accessioned 4-Aug-2021 14:43:11 (UTC+8) - dc.date.available 4-Aug-2021 14:43:11 (UTC+8) - dc.date.issued (上傳時間) 4-Aug-2021 14:43:11 (UTC+8) - dc.identifier (Other Identifiers) G0108354023 en_US dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/136324 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 統計學系 zh_TW dc.description (描述) 108354023 zh_TW dc.description.abstract (摘要) 近年來非結構化數據成長快速,因而引發多位學者針對新聞媒體對於股票報酬之影響此類議題進行研究分析。新聞為一般投資人進行交易行為時,最為普遍接觸之「公開資訊」。然而,新聞文章不若財報資訊中有明確數據資料供投資人研究分析後,作為其投資之參考依據。本研究欲透過文字探勘方法獲取台股新聞情緒信息,並利用新聞情緒分數預測台股報酬。本文依據 Ke, Kelly & Xiu (2019) 提出之文字探勘方法建構台股新聞情緒分數模型(Taiwan Stocks Sentiment Extraction via Screening and Topic Modeling, 台股SESTM),我們發現該方法特別適合用於分析新聞文章與股價走勢之間的變動關係,因此本研究欲將該文字探勘方法拓展至臺灣股票市場,並用於實證臺灣效率市場假說。我們發現使用台股SESTM所估算之新聞情緒分數,於臺灣股票市場建構投資組合交易策略同樣有巨大經濟效益,而該情緒分數對於個股報酬有顯著的預測能力及解釋力。若比較美國與台股SESTM交易策略績效表現,可發現台股SESTM對於新聞發佈前之股票報酬有較高的預測能力。同時也發現,儘管台股SESTM對於股票報酬之預測能力顯著有效,但我們透過評估績效發現,新聞對於臺灣投資人決策行為之影響與美國是顯著不同的,這些結果均符合我們對於臺灣股票市場的經濟直觀。我們期待此研究所建構之台股SESTM能夠幫助臺灣財務文字探勘領域建立研究基底。 zh_TW dc.description.abstract (摘要) In recent years, unstructured data has grown rapidly, which has triggered many scholars to conduct research and analysis on the impact of news media on stock price returns. News article is the most common and accessible “open information” by investors when they conduct transactions. However, news articles, unlike financial report or stock price, news articles cannot be converted to specific numerical data as a reference basis for investment. Our research intends to obtain sentiment information from Taiwan stocks news through text-mining and use news sentiment scores to predict Taiwan stocks` returns. Our research is based on the text-mining methodology introduce by Ke, Kelly & Xiu (2019) to construct a Taiwan stock news sentiment model (Taiwan Stocks Sentiment Extraction via Screening and Topic Modeling, Taiwan SESTM). We found that this methodology is particularly suitable for analyzing the relationship between news articles and stock price trends. Therefore, this study intends to extend this text-mining methodology to the Taiwan stock market and use the empirical analysis of Taiwan`s efficiency-market hypothesis by news articles. We found that using the news sentiment score estimated by Taiwan SESTM to construct a portfolio trading strategy in the Taiwan stock market also has huge economic benefits, and the sentiment score is significantly effective on predict stock returns and explain their correlation. We compare the performance of the United States and Taiwan SESTM trading strategies, we found that Taiwan SESTM has a higher predictive ability for stock price returns before the news articles release. At the same time, we also found the impact of news on the decision making of Taiwanese investors is significantly different with United States by evaluate our portfolio performance. These results are in line with our economic intuition about the Taiwan stock market. We hope that the Taiwan SESTM constructed by this research can help establish a research base in the field of financial text-mining in Taiwan. en_US dc.description.tableofcontents 目錄1 緒論 71.1 研究背景 71.2 研究動機與目的 92 文獻回顧 102.1 自然語言處理 102.1.1 文字探勘及量化 102.1.2 文字探勘於財務領域之應用 112.2 效率市場假說 133 研究方法 163.1 模型設定 163.1.1 資料結構 163.1.2 股票報酬分配 173.1.3 新聞文本分配 173.2 模型估計 183.2.1 篩選情感詞 193.2.2 建構新聞情緒分數模型 203.2.3 估計新文章情緒分數 223.3 台股新聞情緒分數模型估計步驟 234 實證分析 244.1 資料來源與敘述統計 244.2 資料預處理 274.2.1 自然語言處理 274.2.2 正規化 294.2.3 新聞情緒分數範例 304.3 實證結果 334.3.1 訓練及預測股票報酬 334.3.2 情感詞 354.3.3 實證臺灣效率市場假說 364.3.4 新聞與價格延遲之關係 404.3.5 新聞反應速度 435 結論與建議 476 參考文獻 48 zh_TW dc.format.extent 3157934 bytes - dc.format.mimetype application/pdf - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0108354023 en_US dc.subject (關鍵詞) 非結構化數據 zh_TW dc.subject (關鍵詞) 文字探勘 zh_TW dc.subject (關鍵詞) 股票新聞 zh_TW dc.subject (關鍵詞) 機器學習 zh_TW dc.subject (關鍵詞) 預測股票報酬 zh_TW dc.subject (關鍵詞) 情緒分析 zh_TW dc.subject (關鍵詞) 效率市場假說 zh_TW dc.subject (關鍵詞) 超額報酬 zh_TW dc.subject (關鍵詞) Unstructured Data en_US dc.subject (關鍵詞) Text Mining en_US dc.subject (關鍵詞) Stock News en_US dc.subject (關鍵詞) Machine Learning en_US dc.subject (關鍵詞) Predict Stock Returns en_US dc.subject (關鍵詞) Sentiment Analysis en_US dc.subject (關鍵詞) Efficient-Market Hypothesis en_US dc.subject (關鍵詞) Abnormal Returns en_US dc.title (題名) 透過文字探勘預測台股報酬 zh_TW dc.title (題名) Predicting Taiwan Stocks Returns with Text Data en_US dc.type (資料類型) thesis en_US dc.relation.reference (參考文獻) 1. 李昱穎. (2019). 新聞輿情分析在台灣股票市場之應用: 文字轉向量與動能策略. 政治大學金融學系學位論文, 1-40.2. 陳信宏, 陳昱志,& 鄭舜仁.(2006). 以時間數列模型檢定台灣股票市場弱式效率性之研究. 管理科學與統計決策, 3(4), 8-17.3. 鍾任明, 李維平, & 吳澤民. (2005). 運用文字探勘於日內股價漲跌趨勢預測之研究 (Doctoral dissertation, 撰者).4. Azar, P. D., & Lo, A. W. (2016). The wisdom of Twitter crowds: Predicting stock market reactions to FOMC meetings via Twitter feeds. The Journal of Portfolio Management, 42(5), 123-134.5. Alvarez-Ramirez, J., Rodriguez, E., & Espinosa-Paredes, G. (2012). Is the US stock market becoming weakly efficient over time? Evidence from 80-year-long data. Physica A: Statistical Mechanics and its Applications, 391(22), 5643-5647.6. Bernard, V. L., & Thomas, J. K. (1990). Evidence that stock prices do not fully reflect the implications of current earnings for future earnings. Journal of Accounting and Economics, 13(4), 305-340.7. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12, 2493-2537.8. Cowles 3rd, A. (1933). Can stock market forecasters forecast?. Econometrica: Journal of the Econometric Society, 309-324.9. Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., & Salakhutdinov, R. (2019). Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860.10. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.11. Fama, E. F. (1970). Efficient capital markets: A review of theory and empirical work. The Journal of Finance, 25(2), 383-417.12. Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensioal feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5), 849-911.13. Gehring, J., Auli, M., Grangier, D., Yarats, D., & Dauphin, Y. N. (2017, July). Convolutional sequence to sequence learning. In International Conference on Machine Learning (pp. 1243-1252). PMLR.14. Heston, S. L., & Sinha, N. R. (2017). News vs. sentiment: Predicting stock returns from news stories. Financial Analysts Journal, 73(3), 67-83.15. Hutchins, R. M. (1954). Great books. Western World.16. Jegadeesh, N., & Titman, S. (1993). Returns to buying winners and selling losers: Implications for stock market efficiency. The Journal of Finance, 48(1), 65-91.17. Jegadeesh, N., & Wu, D. (2013). Word power: A new approach for content analysis. Journal of Financial Economics, 110(3), 712-729.18. Kalchbrenner, N., Grefenstette, E., & Blunsom, P. (2014). A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188.19. Ke, Z. T., Kelly, B. T., & Xiu, D. (2019). Predicting returns with text data (No. w26186). National Bureau of Economic Research.20. Lakonishok, J., & Vermaelen, T. (1990). Anomalous price behavior around repurchase tender offers. The Journal of Finance, 45(2), 455-477.21. Le, Q., & Mikolov, T. (2014, June). Distributed representations of sentences and documents. In International Conference on Machine Learning (pp. 1188-1196). PMLR.22. Loper, E., & Bird, S. (2002). NLTK: the natural language toolkit. arXiv preprint cs/0205028.23. Loughran, T., & McDonald, B. (2011). When is a liability not a liability? Textual analysis, dictionaries, and 10‐Ks. The Journal of Finance, 66(1), 35-65.24. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, 26, 3111-3119.25. Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Polosukhin, I. (2017). Attention is all you need. arXiv preprint arXiv:1706.03762.26. Ritter, J. R. (1991). The long‐run performance of initial public offerings. The Journal of Finance, 46(1), 3-27.27. Spiess, D. K., & Affleck-Graves, J. (1995). Underperformance in long-run stock returns following seasoned equity offerings. Journal of Financial Economics, 38(3), 243-267.28. Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning withneural networks. arXiv preprint arXiv:1409.3215.29. Tetlock, P. C. (2007). Giving content to investor sentiment: The role of media in the stock market. The Journal of Finance, 62(3), 1139-1168.30. Tetlock, P. C. (2014). Information transmission in finance. Annual Review of Financial Economics, 6(1), 365-384.31. Turing, I. B. A. (1950). Computing machinery and intelligence-AM Turing. Mind, 59(236), 433.32. Wilson, D. S. (1975). A theory of group selection. Proceedings of the National Academy of Sciences, 72(1), 143-146.33. Yang, B., Yih, W. T., He, X., Gao, J., & Deng, L. (2014). Embedding entities and relations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575.34. Zettlemoyer, L. (2018). Deep contextualized word representations. arXiv preprint arXiv:1802.05365.35. Zhang, Y., & Wallace, B. (2015). A sensitivity analysis of (and practitioners` guide to) convolutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820. zh_TW dc.identifier.doi (DOI) 10.6814/NCCU202101087 en_US