透過文字探勘預測台股報酬

Publications-Theses

Article View/Open

pdf(0)

Publication Export

Google Scholar^TM

NCCU Library

Discovery System

Citation Infomation

Related Publications in TAIR

Simple Record
Full Record

題名	透過文字探勘預測台股報酬 Predicting Taiwan Stocks Returns with Text Data
作者	郭亭佑 Kuo, Ting-You
貢獻者	翁久幸<br>林士貴 Weng, Chiu-Hsing<br>Lin, Shih-Kuei 郭亭佑 Kuo, Ting-You
關鍵詞	非結構化數據文字探勘股票新聞機器學習預測股票報酬情緒分析效率市場假說超額報酬 Unstructured Data Text Mining Stock News Machine Learning Predict Stock Returns Sentiment Analysis Efficient-Market Hypothesis Abnormal Returns
日期	2021
上傳時間	4-Aug-2021 14:43:11 (UTC+8)
摘要	近年來非結構化數據成長快速，因而引發多位學者針對新聞媒體對於股票報酬之影響此類議題進行研究分析。新聞為一般投資人進行交易行為時，最為普遍接觸之「公開資訊」。然而，新聞文章不若財報資訊中有明確數據資料供投資人研究分析後，作為其投資之參考依據。本研究欲透過文字探勘方法獲取台股新聞情緒信息，並利用新聞情緒分數預測台股報酬。本文依據 Ke, Kelly & Xiu (2019) 提出之文字探勘方法建構台股新聞情緒分數模型(Taiwan Stocks Sentiment Extraction via Screening and Topic Modeling, 台股SESTM)，我們發現該方法特別適合用於分析新聞文章與股價走勢之間的變動關係，因此本研究欲將該文字探勘方法拓展至臺灣股票市場，並用於實證臺灣效率市場假說。我們發現使用台股SESTM所估算之新聞情緒分數，於臺灣股票市場建構投資組合交易策略同樣有巨大經濟效益，而該情緒分數對於個股報酬有顯著的預測能力及解釋力。若比較美國與台股SESTM交易策略績效表現，可發現台股SESTM對於新聞發佈前之股票報酬有較高的預測能力。同時也發現，儘管台股SESTM對於股票報酬之預測能力顯著有效，但我們透過評估績效發現，新聞對於臺灣投資人決策行為之影響與美國是顯著不同的，這些結果均符合我們對於臺灣股票市場的經濟直觀。我們期待此研究所建構之台股SESTM能夠幫助臺灣財務文字探勘領域建立研究基底。 In recent years, unstructured data has grown rapidly, which has triggered many scholars to conduct research and analysis on the impact of news media on stock price returns. News article is the most common and accessible “open information” by investors when they conduct transactions. However, news articles, unlike financial report or stock price, news articles cannot be converted to specific numerical data as a reference basis for investment. Our research intends to obtain sentiment information from Taiwan stocks news through text-mining and use news sentiment scores to predict Taiwan stocks` returns. Our research is based on the text-mining methodology introduce by Ke, Kelly & Xiu (2019) to construct a Taiwan stock news sentiment model (Taiwan Stocks Sentiment Extraction via Screening and Topic Modeling, Taiwan SESTM). We found that this methodology is particularly suitable for analyzing the relationship between news articles and stock price trends. Therefore, this study intends to extend this text-mining methodology to the Taiwan stock market and use the empirical analysis of Taiwan`s efficiency-market hypothesis by news articles. We found that using the news sentiment score estimated by Taiwan SESTM to construct a portfolio trading strategy in the Taiwan stock market also has huge economic benefits, and the sentiment score is significantly effective on predict stock returns and explain their correlation. We compare the performance of the United States and Taiwan SESTM trading strategies, we found that Taiwan SESTM has a higher predictive ability for stock price returns before the news articles release. At the same time, we also found the impact of news on the decision making of Taiwanese investors is significantly different with United States by evaluate our portfolio performance. These results are in line with our economic intuition about the Taiwan stock market. We hope that the Taiwan SESTM constructed by this research can help establish a research base in the field of financial text-mining in Taiwan.
參考文獻	1. 李昱穎. (2019). 新聞輿情分析在台灣股票市場之應用: 文字轉向量與動能策略. 政治大學金融學系學位論文, 1-40. 2. 陳信宏, 陳昱志,& 鄭舜仁.(2006). 以時間數列模型檢定台灣股票市場弱式效率性之研究. 管理科學與統計決策, 3(4), 8-17. 3. 鍾任明, 李維平, & 吳澤民. (2005). 運用文字探勘於日內股價漲跌趨勢預測之研究 (Doctoral dissertation, 撰者). 4. Azar, P. D., & Lo, A. W. (2016). The wisdom of Twitter crowds: Predicting stock market reactions to FOMC meetings via Twitter feeds. The Journal of Portfolio Management, 42(5), 123-134. 5. Alvarez-Ramirez, J., Rodriguez, E., & Espinosa-Paredes, G. (2012). Is the US stock market becoming weakly efficient over time? Evidence from 80-year-long data. Physica A: Statistical Mechanics and its Applications, 391(22), 5643-5647. 6. Bernard, V. L., & Thomas, J. K. (1990). Evidence that stock prices do not fully reflect the implications of current earnings for future earnings. Journal of Accounting and Economics, 13(4), 305-340. 7. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12, 2493-2537. 8. Cowles 3rd, A. (1933). Can stock market forecasters forecast?. Econometrica: Journal of the Econometric Society, 309-324. 9. Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., & Salakhutdinov, R. (2019). Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860. 10. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 11. Fama, E. F. (1970). Efficient capital markets: A review of theory and empirical work. The Journal of Finance, 25(2), 383-417. 12. Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensioal feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5), 849-911. 13. Gehring, J., Auli, M., Grangier, D., Yarats, D., & Dauphin, Y. N. (2017, July). Convolutional sequence to sequence learning. In International Conference on Machine Learning (pp. 1243-1252). PMLR. 14. Heston, S. L., & Sinha, N. R. (2017). News vs. sentiment: Predicting stock returns from news stories. Financial Analysts Journal, 73(3), 67-83. 15. Hutchins, R. M. (1954). Great books. Western World. 16. Jegadeesh, N., & Titman, S. (1993). Returns to buying winners and selling losers: Implications for stock market efficiency. The Journal of Finance, 48(1), 65-91. 17. Jegadeesh, N., & Wu, D. (2013). Word power: A new approach for content analysis. Journal of Financial Economics, 110(3), 712-729. 18. Kalchbrenner, N., Grefenstette, E., & Blunsom, P. (2014). A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188. 19. Ke, Z. T., Kelly, B. T., & Xiu, D. (2019). Predicting returns with text data (No. w26186). National Bureau of Economic Research. 20. Lakonishok, J., & Vermaelen, T. (1990). Anomalous price behavior around repurchase tender offers. The Journal of Finance, 45(2), 455-477. 21. Le, Q., & Mikolov, T. (2014, June). Distributed representations of sentences and documents. In International Conference on Machine Learning (pp. 1188-1196). PMLR. 22. Loper, E., & Bird, S. (2002). NLTK: the natural language toolkit. arXiv preprint cs/0205028. 23. Loughran, T., & McDonald, B. (2011). When is a liability not a liability? Textual analysis, dictionaries, and 10‐Ks. The Journal of Finance, 66(1), 35-65. 24. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, 26, 3111-3119. 25. Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Polosukhin, I. (2017). Attention is all you need. arXiv preprint arXiv:1706.03762. 26. Ritter, J. R. (1991). The long‐run performance of initial public offerings. The Journal of Finance, 46(1), 3-27. 27. Spiess, D. K., & Affleck-Graves, J. (1995). Underperformance in long-run stock returns following seasoned equity offerings. Journal of Financial Economics, 38(3), 243-267. 28. Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. arXiv preprint arXiv:1409.3215. 29. Tetlock, P. C. (2007). Giving content to investor sentiment: The role of media in the stock market. The Journal of Finance, 62(3), 1139-1168. 30. Tetlock, P. C. (2014). Information transmission in finance. Annual Review of Financial Economics, 6(1), 365-384. 31. Turing, I. B. A. (1950). Computing machinery and intelligence-AM Turing. Mind, 59(236), 433. 32. Wilson, D. S. (1975). A theory of group selection. Proceedings of the National Academy of Sciences, 72(1), 143-146. 33. Yang, B., Yih, W. T., He, X., Gao, J., & Deng, L. (2014). Embedding entities and relations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575. 34. Zettlemoyer, L. (2018). Deep contextualized word representations. arXiv preprint arXiv:1802.05365. 35. Zhang, Y., & Wallace, B. (2015). A sensitivity analysis of (and practitioners` guide to) convolutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820.
描述	碩士國立政治大學統計學系 108354023
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0108354023
資料類型	thesis

dc.contributor.advisor	翁久幸<br>林士貴	zh_TW
dc.contributor.advisor	Weng, Chiu-Hsing<br>Lin, Shih-Kuei	en_US
dc.contributor.author (Authors)	郭亭佑	zh_TW
dc.contributor.author (Authors)	Kuo, Ting-You	en_US
dc.creator (作者)	郭亭佑	zh_TW
dc.creator (作者)	Kuo, Ting-You	en_US
dc.date (日期)	2021	en_US
dc.date.accessioned	4-Aug-2021 14:43:11 (UTC+8)	-
dc.date.available	4-Aug-2021 14:43:11 (UTC+8)	-
dc.date.issued (上傳時間)	4-Aug-2021 14:43:11 (UTC+8)	-
dc.identifier (Other Identifiers)	G0108354023	en_US
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/136324	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	統計學系	zh_TW
dc.description (描述)	108354023	zh_TW
dc.description.abstract (摘要)	近年來非結構化數據成長快速，因而引發多位學者針對新聞媒體對於股票報酬之影響此類議題進行研究分析。新聞為一般投資人進行交易行為時，最為普遍接觸之「公開資訊」。然而，新聞文章不若財報資訊中有明確數據資料供投資人研究分析後，作為其投資之參考依據。本研究欲透過文字探勘方法獲取台股新聞情緒信息，並利用新聞情緒分數預測台股報酬。本文依據 Ke, Kelly & Xiu (2019) 提出之文字探勘方法建構台股新聞情緒分數模型(Taiwan Stocks Sentiment Extraction via Screening and Topic Modeling, 台股SESTM)，我們發現該方法特別適合用於分析新聞文章與股價走勢之間的變動關係，因此本研究欲將該文字探勘方法拓展至臺灣股票市場，並用於實證臺灣效率市場假說。我們發現使用台股SESTM所估算之新聞情緒分數，於臺灣股票市場建構投資組合交易策略同樣有巨大經濟效益，而該情緒分數對於個股報酬有顯著的預測能力及解釋力。若比較美國與台股SESTM交易策略績效表現，可發現台股SESTM對於新聞發佈前之股票報酬有較高的預測能力。同時也發現，儘管台股SESTM對於股票報酬之預測能力顯著有效，但我們透過評估績效發現，新聞對於臺灣投資人決策行為之影響與美國是顯著不同的，這些結果均符合我們對於臺灣股票市場的經濟直觀。我們期待此研究所建構之台股SESTM能夠幫助臺灣財務文字探勘領域建立研究基底。	zh_TW
dc.description.abstract (摘要)	In recent years, unstructured data has grown rapidly, which has triggered many scholars to conduct research and analysis on the impact of news media on stock price returns. News article is the most common and accessible “open information” by investors when they conduct transactions. However, news articles, unlike financial report or stock price, news articles cannot be converted to specific numerical data as a reference basis for investment. Our research intends to obtain sentiment information from Taiwan stocks news through text-mining and use news sentiment scores to predict Taiwan stocks` returns. Our research is based on the text-mining methodology introduce by Ke, Kelly & Xiu (2019) to construct a Taiwan stock news sentiment model (Taiwan Stocks Sentiment Extraction via Screening and Topic Modeling, Taiwan SESTM). We found that this methodology is particularly suitable for analyzing the relationship between news articles and stock price trends. Therefore, this study intends to extend this text-mining methodology to the Taiwan stock market and use the empirical analysis of Taiwan`s efficiency-market hypothesis by news articles. We found that using the news sentiment score estimated by Taiwan SESTM to construct a portfolio trading strategy in the Taiwan stock market also has huge economic benefits, and the sentiment score is significantly effective on predict stock returns and explain their correlation. We compare the performance of the United States and Taiwan SESTM trading strategies, we found that Taiwan SESTM has a higher predictive ability for stock price returns before the news articles release. At the same time, we also found the impact of news on the decision making of Taiwanese investors is significantly different with United States by evaluate our portfolio performance. These results are in line with our economic intuition about the Taiwan stock market. We hope that the Taiwan SESTM constructed by this research can help establish a research base in the field of financial text-mining in Taiwan.	en_US
dc.description.tableofcontents	目錄 1 緒論　　　　　　　　　　　　 7 1.1 研究背景　　　　　　　　　　 7 1.2 研究動機與目的　　　　　　　 9 2 　文獻回顧　　　　　　　　　　 10 2.1 自然語言處理　　　　　　　　 10 2.1.1 文字探勘及量化　　　　　　　 10 2.1.2 文字探勘於財務領域之應用　　 11 2.2 效率市場假說　　　　　　　　 13 3 研究方法　　　　　　　　　　 16 3.1 模型設定　　　　　　　　　　 16 3.1.1 資料結構　　　　　　　　　　 16 3.1.2 股票報酬分配　　　　　　　　 17 3.1.3 新聞文本分配　　　　　　　　 17 3.2 模型估計　　　　　　　　　　 18 3.2.1 篩選情感詞　　　　　　　　　 19 3.2.2 建構新聞情緒分數模型　　　　 20 3.2.3 估計新文章情緒分數　　　　　 22 3.3 台股新聞情緒分數模型估計步驟 23 4 實證分析　　　　　　　　　　 24 4.1 資料來源與敘述統計　　　　　 24 4.2 資料預處理　　　　　　　　　 27 4.2.1 自然語言處理　　　　　　　　 27 4.2.2 正規化　　　　　　　　　　　 29 4.2.3 新聞情緒分數範例　　　　　　 30 4.3 實證結果　　　　　　　　　　 33 4.3.1 訓練及預測股票報酬　　　　　 33 4.3.2 情感詞　　　　　　　　　　　 35 4.3.3 實證臺灣效率市場假說　　　　 36 4.3.4 新聞與價格延遲之關係　　　　 40 4.3.5 新聞反應速度　　　　　　　　 43 5 結論與建議　　　　　　　　　 47 6 參考文獻　　　　　　　　　　 48	zh_TW
dc.format.extent	3157934 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0108354023	en_US
dc.subject (關鍵詞)	非結構化數據	zh_TW
dc.subject (關鍵詞)	文字探勘	zh_TW
dc.subject (關鍵詞)	股票新聞	zh_TW
dc.subject (關鍵詞)	機器學習	zh_TW
dc.subject (關鍵詞)	預測股票報酬	zh_TW
dc.subject (關鍵詞)	情緒分析	zh_TW
dc.subject (關鍵詞)	效率市場假說	zh_TW
dc.subject (關鍵詞)	超額報酬	zh_TW
dc.subject (關鍵詞)	Unstructured Data	en_US
dc.subject (關鍵詞)	Text Mining	en_US
dc.subject (關鍵詞)	Stock News	en_US
dc.subject (關鍵詞)	Machine Learning	en_US
dc.subject (關鍵詞)	Predict Stock Returns	en_US
dc.subject (關鍵詞)	Sentiment Analysis	en_US
dc.subject (關鍵詞)	Efficient-Market Hypothesis	en_US
dc.subject (關鍵詞)	Abnormal Returns	en_US
dc.title (題名)	透過文字探勘預測台股報酬	zh_TW
dc.title (題名)	Predicting Taiwan Stocks Returns with Text Data	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	1. 李昱穎. (2019). 新聞輿情分析在台灣股票市場之應用: 文字轉向量與動能策略. 政治大學金融學系學位論文, 1-40. 2. 陳信宏, 陳昱志,& 鄭舜仁.(2006). 以時間數列模型檢定台灣股票市場弱式效率性之研究. 管理科學與統計決策, 3(4), 8-17. 3. 鍾任明, 李維平, & 吳澤民. (2005). 運用文字探勘於日內股價漲跌趨勢預測之研究 (Doctoral dissertation, 撰者). 4. Azar, P. D., & Lo, A. W. (2016). The wisdom of Twitter crowds: Predicting stock market reactions to FOMC meetings via Twitter feeds. The Journal of Portfolio Management, 42(5), 123-134. 5. Alvarez-Ramirez, J., Rodriguez, E., & Espinosa-Paredes, G. (2012). Is the US stock market becoming weakly efficient over time? Evidence from 80-year-long data. Physica A: Statistical Mechanics and its Applications, 391(22), 5643-5647. 6. Bernard, V. L., & Thomas, J. K. (1990). Evidence that stock prices do not fully reflect the implications of current earnings for future earnings. Journal of Accounting and Economics, 13(4), 305-340. 7. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12, 2493-2537. 8. Cowles 3rd, A. (1933). Can stock market forecasters forecast?. Econometrica: Journal of the Econometric Society, 309-324. 9. Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., & Salakhutdinov, R. (2019). Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860. 10. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 11. Fama, E. F. (1970). Efficient capital markets: A review of theory and empirical work. The Journal of Finance, 25(2), 383-417. 12. Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensioal feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5), 849-911. 13. Gehring, J., Auli, M., Grangier, D., Yarats, D., & Dauphin, Y. N. (2017, July). Convolutional sequence to sequence learning. In International Conference on Machine Learning (pp. 1243-1252). PMLR. 14. Heston, S. L., & Sinha, N. R. (2017). News vs. sentiment: Predicting stock returns from news stories. Financial Analysts Journal, 73(3), 67-83. 15. Hutchins, R. M. (1954). Great books. Western World. 16. Jegadeesh, N., & Titman, S. (1993). Returns to buying winners and selling losers: Implications for stock market efficiency. The Journal of Finance, 48(1), 65-91. 17. Jegadeesh, N., & Wu, D. (2013). Word power: A new approach for content analysis. Journal of Financial Economics, 110(3), 712-729. 18. Kalchbrenner, N., Grefenstette, E., & Blunsom, P. (2014). A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188. 19. Ke, Z. T., Kelly, B. T., & Xiu, D. (2019). Predicting returns with text data (No. w26186). National Bureau of Economic Research. 20. Lakonishok, J., & Vermaelen, T. (1990). Anomalous price behavior around repurchase tender offers. The Journal of Finance, 45(2), 455-477. 21. Le, Q., & Mikolov, T. (2014, June). Distributed representations of sentences and documents. In International Conference on Machine Learning (pp. 1188-1196). PMLR. 22. Loper, E., & Bird, S. (2002). NLTK: the natural language toolkit. arXiv preprint cs/0205028. 23. Loughran, T., & McDonald, B. (2011). When is a liability not a liability? Textual analysis, dictionaries, and 10‐Ks. The Journal of Finance, 66(1), 35-65. 24. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, 26, 3111-3119. 25. Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Polosukhin, I. (2017). Attention is all you need. arXiv preprint arXiv:1706.03762. 26. Ritter, J. R. (1991). The long‐run performance of initial public offerings. The Journal of Finance, 46(1), 3-27. 27. Spiess, D. K., & Affleck-Graves, J. (1995). Underperformance in long-run stock returns following seasoned equity offerings. Journal of Financial Economics, 38(3), 243-267. 28. Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. arXiv preprint arXiv:1409.3215. 29. Tetlock, P. C. (2007). Giving content to investor sentiment: The role of media in the stock market. The Journal of Finance, 62(3), 1139-1168. 30. Tetlock, P. C. (2014). Information transmission in finance. Annual Review of Financial Economics, 6(1), 365-384. 31. Turing, I. B. A. (1950). Computing machinery and intelligence-AM Turing. Mind, 59(236), 433. 32. Wilson, D. S. (1975). A theory of group selection. Proceedings of the National Academy of Sciences, 72(1), 143-146. 33. Yang, B., Yih, W. T., He, X., Gao, J., & Deng, L. (2014). Embedding entities and relations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575. 34. Zettlemoyer, L. (2018). Deep contextualized word representations. arXiv preprint arXiv:1802.05365. 35. Zhang, Y., & Wallace, B. (2015). A sensitivity analysis of (and practitioners` guide to) convolutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820.	zh_TW
dc.identifier.doi (DOI)	10.6814/NCCU202101087	en_US

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM