在臺灣新聞資料下透過貪婪演算法預測股票報酬

Publications-Theses

Article View/Open

pdf(0)

Publication Export

Google Scholar^TM

題名	在臺灣新聞資料下透過貪婪演算法預測股票報酬 Predicting Stock Returns via Greedy Algorithm with Taiwanese News Data
作者	程長磊 Cheng, Chang-Lei
貢獻者	林士貴<br>翁久幸 Lin, Shi-Kui<br>Weng, Chiu-Hsing 程長磊 Cheng, Chang-Lei
關鍵詞	文字探勘統計學習新聞情緒分析預測股票報酬 OGA CGA Text mining Statistical Learning News Sentiment Analysis Stock Returns Prediction OGA CGA
日期	2023
上傳時間	1-Sep-2023 14:58:16 (UTC+8)
摘要	隨著大數據、自然語言處理等領域發展，使得非結構化資料(Unstructured Data)具有極大的學術研究價值，尤其是文本資料。許多研究著手文字訊息對資產報酬之影響，使其成為財務領域中重要的研究目標之一，然而文本資料屬於高維度資料，如何正確分析文本資料與報酬間的關係成為此類研究的重要議題。而新聞文章是投資人在交易時最普遍接觸的文本資料，新聞文章與財報資料不同的地方在於新聞文章並沒有實際量化資料做為投資的依據，因此本研究欲透過Ing and Lai (2011)提出之 Orthogonal Greedy Algorithm (OGA) 以及由Chen, Dai, Ing, Lai (2019) 所改良之Chebyshev Greedy Algorithm (CGA) 高維度選模模型，挑選新聞中常用字詞之文字探勘方法以量化新聞文章之情緒分數，並在排除公司報酬因子下計算新聞情緒因子與公司報酬間之關係，並比較當應變數報酬為線性或是非線性的假設之下，利用新聞情緒分數所建構之投資組合之報酬差異。在應變數報酬為連續變數之線性假設下使用 OGA 並推廣為 OGA Predict模型，而在應變數報酬為非線性假設下則使用CGA並推廣為CGA Predict模型，並將上述兩種選模方法創新應用於財務文本分析之中。我們發現相較於OGA Predict，CGA predict模型可以得到更好的超額報酬，同時透過績效評估發現，新聞文章情緒對於散戶投資人為主的臺灣市場之影響與法人投資人為主的美國市場相比是顯著不同的，其結果也符合我們對於臺灣股票市場的經濟直觀。 The development of unstructured data grows fast and has the value of research along with the improvement of the realm of big data, especially for textual data. However, textual data are high dimensional data (i.e. the number of text in the news articles far exceeded than the news articles themselves.), therefore analyzing the relationship between textual data and the average return correctly has been an important issue according to this realm of research. When trading, the textual data that are most commonly received by investors are news articles. The difference between news articles and financial statements is that news articles can not provide quantitative information as an investment foundation. Therefore, we suppose to use two different kinds of high dimensional model selection methods, Orthogonal Greedy Algorithm(Ing and Lai (2011)) and Chebyshev Greedy Algorithm(Chen, Dai, Ing, Lai(2019)), and then select the frequently use words from news articles in order to quantify the sentiment scores of news articles. Moreover, we compare the difference of the portfolio returns which are constructed under two different assumptions(linear or nonlinear) of dependent variables according to the news sentiments. We use the OGA predict model to construct news sentiment when the dependent variable is under linear assumption, otherwise, we use the CGA predict. We find that the average return from the CGA predict model is better than the average return from the OGA predict model. Moreover, there is a significant difference in decision making when trading between the Taiwanese market and US market.
參考文獻	1. 郭亭佑. (2021). 透過文字探勘預測台股報酬. 政治大學金融學系學位論文 2. Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3,993-1022 3. Chen, Y. L, Dai, C. S and Ing, C. K (2019). High dimensional model selection via Chebyshev greedy algorithms. Working paper. 4. Fan, J., Xue, L., and Zhou, Y. (2021). How much can machines learn finance from Chinese text data?. Working Paper. 5. Gentzkow, M., Kelly, B., and Taddy, M. (2019). Text as data. Journal of Economic Literature, 57 (3), 535-74. 6. Henry, E. (2008). Are investors influenced by how earnings press releases are written?. The Journal of Business Communication, 45(4), 363–407. 7. Ing, C. K., and Lai, T. L. (2011). A stepwise regression method and consistent model selection for high-dimensional sparse linear models. Statistica Sinica, 1473-1513. 8. Jegadeesh, N., and Wu, D. (2013). Word power: A new approach for content analysis. Journal of Financial Economics, 110(3), 712-729. 9. Ke, Z. T., Kelly, B. T., and Xiu, D. (2019). Predicting returns with text data. Working Paper. 10. Loughran, T., and McDonald, B. (2011). When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks. Journal of Finance, 66(1), 35-65. 11. Manela, A., and Moreira, A. (2017). News implied volatility and disaster concerns. Journal of Financial Economics, 123(1), 137–162. 12. Temlyakov, V. N. (2015). Greedy approximation in convex optimization. Constructive Approximation, 41(2), 269-296. 13. Tetlock, P. C. (2007). Giving content to investor sentiment: The role of media in the stock market. The Journal of Finance, 62(3), 1139-1168. 14. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267-288. 15. You, J., Zhang, B., and Zhang, L. (2018). Who captures the power of the pen?. Review of Financial Studies, 31(1), 43–96.
描述	碩士國立政治大學統計學系 110354030
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0110354030
資料類型	thesis

dc.contributor.advisor	林士貴<br>翁久幸	zh_TW
dc.contributor.advisor	Lin, Shi-Kui<br>Weng, Chiu-Hsing	en_US
dc.contributor.author (Authors)	程長磊	zh_TW
dc.contributor.author (Authors)	Cheng, Chang-Lei	en_US
dc.creator (作者)	程長磊	zh_TW
dc.creator (作者)	Cheng, Chang-Lei	en_US
dc.date (日期)	2023	en_US
dc.date.accessioned	1-Sep-2023 14:58:16 (UTC+8)	-
dc.date.available	1-Sep-2023 14:58:16 (UTC+8)	-
dc.date.issued (上傳時間)	1-Sep-2023 14:58:16 (UTC+8)	-
dc.identifier (Other Identifiers)	G0110354030	en_US
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/146908	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	統計學系	zh_TW
dc.description (描述)	110354030	zh_TW
dc.description.abstract (摘要)	隨著大數據、自然語言處理等領域發展，使得非結構化資料(Unstructured Data)具有極大的學術研究價值，尤其是文本資料。許多研究著手文字訊息對資產報酬之影響，使其成為財務領域中重要的研究目標之一，然而文本資料屬於高維度資料，如何正確分析文本資料與報酬間的關係成為此類研究的重要議題。而新聞文章是投資人在交易時最普遍接觸的文本資料，新聞文章與財報資料不同的地方在於新聞文章並沒有實際量化資料做為投資的依據，因此本研究欲透過Ing and Lai (2011)提出之 Orthogonal Greedy Algorithm (OGA) 以及由Chen, Dai, Ing, Lai (2019) 所改良之Chebyshev Greedy Algorithm (CGA) 高維度選模模型，挑選新聞中常用字詞之文字探勘方法以量化新聞文章之情緒分數，並在排除公司報酬因子下計算新聞情緒因子與公司報酬間之關係，並比較當應變數報酬為線性或是非線性的假設之下，利用新聞情緒分數所建構之投資組合之報酬差異。在應變數報酬為連續變數之線性假設下使用 OGA 並推廣為 OGA Predict模型，而在應變數報酬為非線性假設下則使用CGA並推廣為CGA Predict模型，並將上述兩種選模方法創新應用於財務文本分析之中。我們發現相較於OGA Predict，CGA predict模型可以得到更好的超額報酬，同時透過績效評估發現，新聞文章情緒對於散戶投資人為主的臺灣市場之影響與法人投資人為主的美國市場相比是顯著不同的，其結果也符合我們對於臺灣股票市場的經濟直觀。	zh_TW
dc.description.abstract (摘要)	The development of unstructured data grows fast and has the value of research along with the improvement of the realm of big data, especially for textual data. However, textual data are high dimensional data (i.e. the number of text in the news articles far exceeded than the news articles themselves.), therefore analyzing the relationship between textual data and the average return correctly has been an important issue according to this realm of research. When trading, the textual data that are most commonly received by investors are news articles. The difference between news articles and financial statements is that news articles can not provide quantitative information as an investment foundation. Therefore, we suppose to use two different kinds of high dimensional model selection methods, Orthogonal Greedy Algorithm(Ing and Lai (2011)) and Chebyshev Greedy Algorithm(Chen, Dai, Ing, Lai(2019)), and then select the frequently use words from news articles in order to quantify the sentiment scores of news articles. Moreover, we compare the difference of the portfolio returns which are constructed under two different assumptions(linear or nonlinear) of dependent variables according to the news sentiments. We use the OGA predict model to construct news sentiment when the dependent variable is under linear assumption, otherwise, we use the CGA predict. We find that the average return from the CGA predict model is better than the average return from the OGA predict model. Moreover, there is a significant difference in decision making when trading between the Taiwanese market and US market.	en_US
dc.description.tableofcontents	摘要.......................................... i Abstract..................................... ii 目錄.......................................... iii 圖目錄 ........................................ v 表目錄 ........................................ vi 1 緒論........................................ 1 1.1 研究背景................................... 1 1.2 研究動機與目的.............................. 3 2 文獻回顧..................................... 4 2.1 財務文本分析 ............................... 4 2.1.1 傳統字典方法 ........................... 4 2.1.2 機器學習方法 ........................... 5 2.1.3 統計計量模型 ........................... 5 2.2 高維度選模方法 ........................... 6 3 研究方法 .................................. 8 3.1 資料結構 ................................ 8 3.2 Pure Greedy Algorithm . ................ 9 3.2.1 迴歸模型設定 ........................... 9 3.2.2 PGA process .......................... 9 3.3 OrthogonalGreedyAlgorithm .............. 11 3.3.1 OGA process........................... 11 3.3.2 High Dimensional Information Criterion (HDIC)... 11 3.3.3 Trim................................. 12 3.3.4 OGA Predict介紹 ....................... 12 3.4 Chebyshev Greedy Algorithm ............. 14 3.4.1 迴歸模型設定 ........................... 14 3.4.2 CGA process ........................... 14 3.4.3 High Dimensional Information Criterion (HDIC) ... 15 3.4.4 Trim................................. 15 3.4.5 CGAPredict介紹 ......................... 16 4 實證分析..................................... 17 4.1 資料來源與敘述統計 .......................... 17 4.2 資料預處理及統計分析流程 ..................... 20 4.2.1 自然語言處理 ............................. 20 4.2.2 正規化................................... 21 4.2.3 資料預處理與統計分析流程 ................... 23 4.3 實證結果................................... 23 4.3.1 預測股票報酬 ........................... 23 4.3.2 字詞挑選結果 ........................... 24 4.3.3 投資組合報酬比較 ........................ 25 4.3.4 新聞與價格延遲之關係與反應速度 ............... 28 4.3.5 異質性分析 ............................ 29 5 結論與建議.................................... 33 5.1 結論與建議................................. 33 6 參考文獻..................................... 34	zh_TW
dc.format.extent	3382854 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0110354030	en_US
dc.subject (關鍵詞)	文字探勘	zh_TW
dc.subject (關鍵詞)	統計學習	zh_TW
dc.subject (關鍵詞)	新聞情緒分析	zh_TW
dc.subject (關鍵詞)	預測股票報酬	zh_TW
dc.subject (關鍵詞)	OGA	zh_TW
dc.subject (關鍵詞)	CGA	zh_TW
dc.subject (關鍵詞)	Text mining	en_US
dc.subject (關鍵詞)	Statistical Learning	en_US
dc.subject (關鍵詞)	News Sentiment Analysis	en_US
dc.subject (關鍵詞)	Stock Returns Prediction	en_US
dc.subject (關鍵詞)	OGA	en_US
dc.subject (關鍵詞)	CGA	en_US
dc.title (題名)	在臺灣新聞資料下透過貪婪演算法預測股票報酬	zh_TW
dc.title (題名)	Predicting Stock Returns via Greedy Algorithm with Taiwanese News Data	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	1. 郭亭佑. (2021). 透過文字探勘預測台股報酬. 政治大學金融學系學位論文 2. Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3,993-1022 3. Chen, Y. L, Dai, C. S and Ing, C. K (2019). High dimensional model selection via Chebyshev greedy algorithms. Working paper. 4. Fan, J., Xue, L., and Zhou, Y. (2021). How much can machines learn finance from Chinese text data?. Working Paper. 5. Gentzkow, M., Kelly, B., and Taddy, M. (2019). Text as data. Journal of Economic Literature, 57 (3), 535-74. 6. Henry, E. (2008). Are investors influenced by how earnings press releases are written?. The Journal of Business Communication, 45(4), 363–407. 7. Ing, C. K., and Lai, T. L. (2011). A stepwise regression method and consistent model selection for high-dimensional sparse linear models. Statistica Sinica, 1473-1513. 8. Jegadeesh, N., and Wu, D. (2013). Word power: A new approach for content analysis. Journal of Financial Economics, 110(3), 712-729. 9. Ke, Z. T., Kelly, B. T., and Xiu, D. (2019). Predicting returns with text data. Working Paper. 10. Loughran, T., and McDonald, B. (2011). When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks. Journal of Finance, 66(1), 35-65. 11. Manela, A., and Moreira, A. (2017). News implied volatility and disaster concerns. Journal of Financial Economics, 123(1), 137–162. 12. Temlyakov, V. N. (2015). Greedy approximation in convex optimization. Constructive Approximation, 41(2), 269-296. 13. Tetlock, P. C. (2007). Giving content to investor sentiment: The role of media in the stock market. The Journal of Finance, 62(3), 1139-1168. 14. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267-288. 15. You, J., Zhang, B., and Zhang, L. (2018). Who captures the power of the pen?. Review of Financial Studies, 31(1), 43–96.	zh_TW

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM