Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

題名 在臺灣新聞資料下透過貪婪演算法預測股票報酬
Predicting Stock Returns via Greedy Algorithm with Taiwanese News Data
作者 程長磊
Cheng, Chang-Lei
貢獻者 林士貴<br>翁久幸
Lin, Shi-Kui<br>Weng, Chiu-Hsing
程長磊
Cheng, Chang-Lei
關鍵詞 文字探勘
統計學習
新聞情緒分析
預測股票報酬
OGA
CGA
Text mining
Statistical Learning
News Sentiment Analysis
Stock Returns Prediction
OGA
CGA
日期 2023
上傳時間 1-Sep-2023 14:58:16 (UTC+8)
摘要 隨著大數據、自然語言處理等領域發展,使得非結構化資料(Unstructured Data)具有極大的學術研究價值,尤其是文本資料。許多研究著手文字訊息對資產報酬之影響,使其成為財務領域中重要的研究目標之一,然而文本資料屬於高維度資料,如何正確分析文本資料與報酬間的關係成為此類研究的重要議題。而新聞文章是投資人在交易時最普遍接觸的文本資料,新聞文章與財報資料不同的地方在於新聞文章並沒有實際量化資料做為投資的依據,因此本研究欲透過Ing and Lai (2011)提出之 Orthogonal Greedy Algorithm (OGA) 以及由Chen, Dai, Ing, Lai (2019) 所改良之Chebyshev Greedy Algorithm (CGA) 高維度選模模型,挑選新聞中常用字詞之文字探勘方法以量化新聞文章之情緒分數,並在排除公司報酬因子下計算新聞情緒因子與公司報酬間之關係,並比較當應變數報酬為線性或是非線性的假設之下,利用新聞情緒分數所建構之投資組合之報酬差異。在應變數報酬為連續變數之線性假設下使用 OGA 並推廣為 OGA Predict模型,而在應變數報酬為非線性假設下則使用CGA並推廣為CGA Predict模型,並將上述兩種選模方法創新應用於財務文本分析之中。我們發現相較於OGA Predict,CGA predict模型可以得到更好的超額報酬,同時透過績效評估發現,新聞文章情緒對於散戶投資人為主的臺灣市場之影響與法人投資人為主的美國市場相比是顯著不同的,其結果也符合我們對於臺灣股票市場的經濟直觀。
The development of unstructured data grows fast and has the value of research along with the improvement of the realm of big data, especially for textual data. However, textual data are high dimensional data (i.e. the number of text in the news articles far exceeded than the news articles themselves.), therefore analyzing the relationship between textual data and the average return correctly has been an important issue according to this realm of research. When trading, the textual data that are most commonly received by investors are news articles. The difference between news articles and financial statements is that news articles can not provide quantitative information as an investment foundation. Therefore, we suppose to use two different kinds of high dimensional model selection methods, Orthogonal Greedy Algorithm(Ing and Lai (2011)) and Chebyshev Greedy Algorithm(Chen, Dai, Ing, Lai(2019)), and then select the frequently use words from news articles in order to quantify the sentiment scores of news articles. Moreover, we compare the difference of the portfolio returns which are constructed under two different assumptions(linear or nonlinear) of dependent variables according to the news sentiments. We use the OGA predict model to construct news sentiment when the dependent variable is under linear assumption, otherwise, we use the CGA predict. We find that the average return from the CGA predict model is better than the average return from the OGA predict model. Moreover, there is a significant difference in decision making when trading between the Taiwanese market and US market.
參考文獻 1. 郭亭佑. (2021). 透過文字探勘預測台股報酬. 政治大學金融學系學位論文
2. Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3,993-1022
3. Chen, Y. L, Dai, C. S and Ing, C. K (2019). High dimensional model selection via Chebyshev greedy algorithms. Working paper.
4. Fan, J., Xue, L., and Zhou, Y. (2021). How much can machines learn finance from Chinese text data?. Working Paper.
5. Gentzkow, M., Kelly, B., and Taddy, M. (2019). Text as data. Journal of Economic Literature, 57 (3), 535-74.
6. Henry, E. (2008). Are investors influenced by how earnings press releases are written?. The Journal of Business Communication, 45(4), 363–407.
7. Ing, C. K., and Lai, T. L. (2011). A stepwise regression method and consistent model selection for high-dimensional sparse linear models. Statistica Sinica, 1473-1513.
8. Jegadeesh, N., and Wu, D. (2013). Word power: A new approach for content analysis. Journal of Financial Economics, 110(3), 712-729.
9. Ke, Z. T., Kelly, B. T., and Xiu, D. (2019). Predicting returns with text data. Working Paper.
10. Loughran, T., and McDonald, B. (2011). When is a liability not a liability? Textual analysis,
dictionaries, and 10-Ks. Journal of Finance, 66(1), 35-65.
11. Manela, A., and Moreira, A. (2017). News implied volatility and disaster concerns. Journal of Financial Economics, 123(1), 137–162.
12. Temlyakov, V. N. (2015). Greedy approximation in convex optimization. Constructive Approximation, 41(2), 269-296.
13. Tetlock, P. C. (2007). Giving content to investor sentiment: The role of media in the stock market. The Journal of Finance, 62(3), 1139-1168.
14. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267-288.
15. You, J., Zhang, B., and Zhang, L. (2018). Who captures the power of the pen?. Review of Financial Studies, 31(1), 43–96.
描述 碩士
國立政治大學
統計學系
110354030
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0110354030
資料類型 thesis
dc.contributor.advisor 林士貴<br>翁久幸zh_TW
dc.contributor.advisor Lin, Shi-Kui<br>Weng, Chiu-Hsingen_US
dc.contributor.author (Authors) 程長磊zh_TW
dc.contributor.author (Authors) Cheng, Chang-Leien_US
dc.creator (作者) 程長磊zh_TW
dc.creator (作者) Cheng, Chang-Leien_US
dc.date (日期) 2023en_US
dc.date.accessioned 1-Sep-2023 14:58:16 (UTC+8)-
dc.date.available 1-Sep-2023 14:58:16 (UTC+8)-
dc.date.issued (上傳時間) 1-Sep-2023 14:58:16 (UTC+8)-
dc.identifier (Other Identifiers) G0110354030en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/146908-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 統計學系zh_TW
dc.description (描述) 110354030zh_TW
dc.description.abstract (摘要) 隨著大數據、自然語言處理等領域發展,使得非結構化資料(Unstructured Data)具有極大的學術研究價值,尤其是文本資料。許多研究著手文字訊息對資產報酬之影響,使其成為財務領域中重要的研究目標之一,然而文本資料屬於高維度資料,如何正確分析文本資料與報酬間的關係成為此類研究的重要議題。而新聞文章是投資人在交易時最普遍接觸的文本資料,新聞文章與財報資料不同的地方在於新聞文章並沒有實際量化資料做為投資的依據,因此本研究欲透過Ing and Lai (2011)提出之 Orthogonal Greedy Algorithm (OGA) 以及由Chen, Dai, Ing, Lai (2019) 所改良之Chebyshev Greedy Algorithm (CGA) 高維度選模模型,挑選新聞中常用字詞之文字探勘方法以量化新聞文章之情緒分數,並在排除公司報酬因子下計算新聞情緒因子與公司報酬間之關係,並比較當應變數報酬為線性或是非線性的假設之下,利用新聞情緒分數所建構之投資組合之報酬差異。在應變數報酬為連續變數之線性假設下使用 OGA 並推廣為 OGA Predict模型,而在應變數報酬為非線性假設下則使用CGA並推廣為CGA Predict模型,並將上述兩種選模方法創新應用於財務文本分析之中。我們發現相較於OGA Predict,CGA predict模型可以得到更好的超額報酬,同時透過績效評估發現,新聞文章情緒對於散戶投資人為主的臺灣市場之影響與法人投資人為主的美國市場相比是顯著不同的,其結果也符合我們對於臺灣股票市場的經濟直觀。zh_TW
dc.description.abstract (摘要) The development of unstructured data grows fast and has the value of research along with the improvement of the realm of big data, especially for textual data. However, textual data are high dimensional data (i.e. the number of text in the news articles far exceeded than the news articles themselves.), therefore analyzing the relationship between textual data and the average return correctly has been an important issue according to this realm of research. When trading, the textual data that are most commonly received by investors are news articles. The difference between news articles and financial statements is that news articles can not provide quantitative information as an investment foundation. Therefore, we suppose to use two different kinds of high dimensional model selection methods, Orthogonal Greedy Algorithm(Ing and Lai (2011)) and Chebyshev Greedy Algorithm(Chen, Dai, Ing, Lai(2019)), and then select the frequently use words from news articles in order to quantify the sentiment scores of news articles. Moreover, we compare the difference of the portfolio returns which are constructed under two different assumptions(linear or nonlinear) of dependent variables according to the news sentiments. We use the OGA predict model to construct news sentiment when the dependent variable is under linear assumption, otherwise, we use the CGA predict. We find that the average return from the CGA predict model is better than the average return from the OGA predict model. Moreover, there is a significant difference in decision making when trading between the Taiwanese market and US market.en_US
dc.description.tableofcontents 摘要.......................................... i
Abstract..................................... ii
目錄.......................................... iii
圖目錄 ........................................ v
表目錄 ........................................ vi
1 緒論........................................ 1
1.1 研究背景................................... 1
1.2 研究動機與目的.............................. 3
2 文獻回顧..................................... 4
2.1 財務文本分析 ............................... 4
2.1.1 傳統字典方法 ........................... 4
2.1.2 機器學習方法 ........................... 5
2.1.3 統計計量模型 ........................... 5
2.2 高維度選模方法 ........................... 6
3 研究方法 .................................. 8
3.1 資料結構 ................................ 8
3.2 Pure Greedy Algorithm . ................ 9
3.2.1 迴歸模型設定 ........................... 9
3.2.2 PGA process .......................... 9
3.3 OrthogonalGreedyAlgorithm .............. 11
3.3.1 OGA process........................... 11
3.3.2 High Dimensional Information Criterion (HDIC)... 11
3.3.3 Trim................................. 12
3.3.4 OGA Predict介紹 ....................... 12
3.4 Chebyshev Greedy Algorithm ............. 14
3.4.1 迴歸模型設定 ........................... 14
3.4.2 CGA process ........................... 14
3.4.3 High Dimensional Information Criterion (HDIC) ... 15
3.4.4 Trim................................. 15
3.4.5 CGAPredict介紹 ......................... 16
4 實證分析..................................... 17
4.1 資料來源與敘述統計 .......................... 17
4.2 資料預處理及統計分析流程 ..................... 20
4.2.1 自然語言處理 ............................. 20
4.2.2 正規化................................... 21
4.2.3 資料預處理與統計分析流程 ................... 23
4.3 實證結果................................... 23
4.3.1 預測股票報酬 ........................... 23
4.3.2 字詞挑選結果 ........................... 24
4.3.3 投資組合報酬比較 ........................ 25
4.3.4 新聞與價格延遲之關係與反應速度 ............... 28
4.3.5 異質性分析 ............................ 29
5 結論與建議.................................... 33
5.1 結論與建議................................. 33
6 參考文獻..................................... 34
zh_TW
dc.format.extent 3382854 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0110354030en_US
dc.subject (關鍵詞) 文字探勘zh_TW
dc.subject (關鍵詞) 統計學習zh_TW
dc.subject (關鍵詞) 新聞情緒分析zh_TW
dc.subject (關鍵詞) 預測股票報酬zh_TW
dc.subject (關鍵詞) OGAzh_TW
dc.subject (關鍵詞) CGAzh_TW
dc.subject (關鍵詞) Text miningen_US
dc.subject (關鍵詞) Statistical Learningen_US
dc.subject (關鍵詞) News Sentiment Analysisen_US
dc.subject (關鍵詞) Stock Returns Predictionen_US
dc.subject (關鍵詞) OGAen_US
dc.subject (關鍵詞) CGAen_US
dc.title (題名) 在臺灣新聞資料下透過貪婪演算法預測股票報酬zh_TW
dc.title (題名) Predicting Stock Returns via Greedy Algorithm with Taiwanese News Dataen_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) 1. 郭亭佑. (2021). 透過文字探勘預測台股報酬. 政治大學金融學系學位論文
2. Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3,993-1022
3. Chen, Y. L, Dai, C. S and Ing, C. K (2019). High dimensional model selection via Chebyshev greedy algorithms. Working paper.
4. Fan, J., Xue, L., and Zhou, Y. (2021). How much can machines learn finance from Chinese text data?. Working Paper.
5. Gentzkow, M., Kelly, B., and Taddy, M. (2019). Text as data. Journal of Economic Literature, 57 (3), 535-74.
6. Henry, E. (2008). Are investors influenced by how earnings press releases are written?. The Journal of Business Communication, 45(4), 363–407.
7. Ing, C. K., and Lai, T. L. (2011). A stepwise regression method and consistent model selection for high-dimensional sparse linear models. Statistica Sinica, 1473-1513.
8. Jegadeesh, N., and Wu, D. (2013). Word power: A new approach for content analysis. Journal of Financial Economics, 110(3), 712-729.
9. Ke, Z. T., Kelly, B. T., and Xiu, D. (2019). Predicting returns with text data. Working Paper.
10. Loughran, T., and McDonald, B. (2011). When is a liability not a liability? Textual analysis,
dictionaries, and 10-Ks. Journal of Finance, 66(1), 35-65.
11. Manela, A., and Moreira, A. (2017). News implied volatility and disaster concerns. Journal of Financial Economics, 123(1), 137–162.
12. Temlyakov, V. N. (2015). Greedy approximation in convex optimization. Constructive Approximation, 41(2), 269-296.
13. Tetlock, P. C. (2007). Giving content to investor sentiment: The role of media in the stock market. The Journal of Finance, 62(3), 1139-1168.
14. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267-288.
15. You, J., Zhang, B., and Zhang, L. (2018). Who captures the power of the pen?. Review of Financial Studies, 31(1), 43–96.
zh_TW