Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

題名 財報文字分析之句子風險程度偵測研究
Risk-related Sentence Detection in Financial Reports
作者 柳育彣
Liu, Yu-Wen
貢獻者 蔡銘峰<br>王釧茹
Tsai, Ming-Feng<br>Wang, Chuan-Ju
柳育彣
Liu, Yu-Wen
關鍵詞 文字探勘
財務風險
情緒分析
機器學習
Text mining
Financial risk
Sentiment analysis
Machine learning
日期 2017
上傳時間 1-Nov-2017 14:23:00 (UTC+8)
摘要 本論文的目標是利用文本情緒分析技巧,針對美國上市公司的財務報表進行以句子為單位的風險評估。過去的財報文本分析研究裡,大多關注於詞彙層面的風險偵測。然而財務文本中大多數的財務詞彙與前後文具有高度的語意相關性,僅靠閱讀單一詞彙可能無法完全理解其隱含的財務訊息。本文將研究層次由詞彙拉升至句子,根據基於嵌入概念的~fastText~與~Siamese CBOW~兩種句子向量表示法學習模型,利用基於嵌入概念模型中,使用目標詞與前後詞彙關聯性表示目標詞語意的特性,萃取出財報句子裡更深層的財務意涵,並學習出更適合用於財務文本分析的句向量表示法。實驗驗證部分,我們利用~10-K~財報資料與本文提出的財務標記資料集進行財務風險分類器學習,並以傳統詞袋模型(Bag-of-Word)作為基準,利用精確度(Accuracy)與準確度(Precision)等評估標準進行比較。結果證實基於嵌入概念模型的表示法在財務風險評估上比傳統詞袋模型有著更準確的預測表現。由於近年大數據時代的來臨,網路中的資訊量大幅成長,依賴少量人力在短期間內分析海量的財務資訊變得更加困難。因此如何協助專業人員進行有效率的財務判斷與決策,已成為一項重要的議題。為此,本文同時提出一個以句子為分析單位的財報風險語句偵測系統~RiskFinder~,依照~fastText~與~Siamese CBOW~兩種模型,經由~10-K~財務報表與人工標記資料集學習出適當的風險語句分類器後,對~1996~至~2013~年的美國上市公司財務報表進行財報句子的自動風險預測,讓財務專業人士能透過系統的協助,有效率地由大量財務文本中獲得有意義的財務資訊。此外,系統會依照公司的財報發布日期動態呈現股票交易資訊與後設資料,以利使用者依股價的時間走勢比較財務文字型與數值型資料的關係。
The main purpose of this paper is to evaluate the risk of financial report of listed companies in sentence-level. Most of past sentiment analysis studies focused on word-level risk detection. However, most financial keywords are highly context-sensitive, which may likely yield biased results. Therefore, to advance the understanding of financial textual information, this thesis broadens the analysis from word-level to sentence level. We use two sentence-level models, fastText and Siamese-CBOW, to learn sentence embedding and attempt to facilitate the financial risk detection. In our experiment, we use the 10-K corpus and a financial sentiment dataset which were labeled by financial professionals to train our financial risk classifier. Moreover, we adopt the Bag-of-Word model as a baseline and use accuracy, precision, recall and F1-score to evaluate the performance of financial risk prediction. The experimental results show that the embedding models could lead better performance than the Bag-of-word model. In addition, this paper proposes a web-based financial risk detection system which is constructed based on fastText and Siamese CBOW model called RiskFinder. There are total 40,708 financial reports inside the system and each risk-related sentence is highlighted based on different sentence embedding models. Besides, our system also provides metadata and a visualization of financial time-series data for the corresponding company according to release day of financial report. This system considerably facilitates case studies in the field of finance and can be of great help in capturing valuable insight within large amounts of textual information.
參考文獻 [1] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A Neural Probabilistic Language Model. Journal of Machine Learning Research, 3(Feb):1137–1155, 2003.
[2] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching Word Vectors with Subword Information. arXiv preprint arXiv:1607.04606, 2016.
[3] J. L. Campbell, H. Chen, D. S. Dhaliwal, H.-m. Lu, and L. B. Steele. The Information Content of Mandatory Risk Factor Disclosures in Corporate Filings. Review of Accounting Studies, 19(1):396–455, 2014.
[4] D. J. Denis and I. Osobov. Why do Firms Pay Dividends? International Evidence on The Determinants of Dividend Policy. Journal of Financial Economics, 89(1):62–82, 2008.
[5] E. F. Fama and K. R. French. Industry Costs of Equity. Journal of Financial Economics, 43(2):153–193, 1997.
[6] M. Faruqui, J. Dodge, S. K. Jauhar, C. Dyer, E. Hovy, and N. A. Smith. Retrofitting Word Vectors to Semantic Lexicons. arXiv preprint arXiv:1411.4166, 2014.
[7] N. Jegadeesh and D. Wu. Word Power: A New Approach for Content Analysis. Journal of Financial Economics, 110(3):712–729, 2013.
[8] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov. Bag of Tricks for Efficient Text Classification. arXiv preprint arXiv:1607.01759, 2016.
[9] T. Kenter, A. Borisov, and M. de Rijke. Siamese Cbow: Optimizing Word Embeddings for Sentence Representations. arXiv preprint arXiv:1606.04640, 2016.
[10] S. Kogan, D. Levin, B. R. Routledge, J. S. Sagi, and N. A. Smith. Predicting risk from Financial Reports with Regression. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association
for Computational Linguistics, pages 272–280. Association for Computational Linguistics, 2009.
[11] T. Loughran and B. McDonald. When is a Liability not a Liability? Textual Analysis, Dictionaries, and 10-ks. The Journal of Finance, 66(1):35–65, 2011.
[12] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781, 2013.
[13] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed Representations of Words and Phrases and Their Compositionality. In Advances in Neural Information Processing Systems, pages 3111–3119, 2013.
[14] N. Rekabsaz, M. Lupu, A. Baklanov, A. Hanbury, A. D¨ur, and L. Anderson. Volatility Prediction Using Financial Disclosures Sentiments with Word Embedding-Based IR Models. arXiv preprint arXiv:1702.01978, 2017.
[15] W. F. Sharpe. Capital Asset Prices: A Theory of Market Equilibrium Under Conditions of Risk. The Journal of Finance, 19(3):425–442, 1964.
[16] M.-F. Tsai, C.-J. Wang, and P.-C. Chien. Discovering Finance Keywords via Continuous-space Language Models. ACM Transactions on Management Information Systems (TMIS), 7(3):7, 2016.
[17] C.-J. Wang, M.-F. Tsai, T. Liu, and C.-T. Chang. Financial Sentiment Analysis for Risk Prediction. In Proceedings of the 6th International Joint Conference on Natural Language Processing., pages 802–808, 2013.
[18] S. Wang and C. D. Manning. Baselines and Bigrams: Simple, Good Sentiment and Topic Classification. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2, pages 90–94. Association
for Computational Linguistics, 2012.
描述 碩士
國立政治大學
資訊科學學系
104753035
資料來源 http://thesis.lib.nccu.edu.tw/record/#G1047530352
資料類型 thesis
dc.contributor.advisor 蔡銘峰<br>王釧茹zh_TW
dc.contributor.advisor Tsai, Ming-Feng<br>Wang, Chuan-Juen_US
dc.contributor.author (Authors) 柳育彣zh_TW
dc.contributor.author (Authors) Liu, Yu-Wenen_US
dc.creator (作者) 柳育彣zh_TW
dc.creator (作者) Liu, Yu-Wenen_US
dc.date (日期) 2017en_US
dc.date.accessioned 1-Nov-2017 14:23:00 (UTC+8)-
dc.date.available 1-Nov-2017 14:23:00 (UTC+8)-
dc.date.issued (上傳時間) 1-Nov-2017 14:23:00 (UTC+8)-
dc.identifier (Other Identifiers) G1047530352en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/114289-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊科學學系zh_TW
dc.description (描述) 104753035zh_TW
dc.description.abstract (摘要) 本論文的目標是利用文本情緒分析技巧,針對美國上市公司的財務報表進行以句子為單位的風險評估。過去的財報文本分析研究裡,大多關注於詞彙層面的風險偵測。然而財務文本中大多數的財務詞彙與前後文具有高度的語意相關性,僅靠閱讀單一詞彙可能無法完全理解其隱含的財務訊息。本文將研究層次由詞彙拉升至句子,根據基於嵌入概念的~fastText~與~Siamese CBOW~兩種句子向量表示法學習模型,利用基於嵌入概念模型中,使用目標詞與前後詞彙關聯性表示目標詞語意的特性,萃取出財報句子裡更深層的財務意涵,並學習出更適合用於財務文本分析的句向量表示法。實驗驗證部分,我們利用~10-K~財報資料與本文提出的財務標記資料集進行財務風險分類器學習,並以傳統詞袋模型(Bag-of-Word)作為基準,利用精確度(Accuracy)與準確度(Precision)等評估標準進行比較。結果證實基於嵌入概念模型的表示法在財務風險評估上比傳統詞袋模型有著更準確的預測表現。由於近年大數據時代的來臨,網路中的資訊量大幅成長,依賴少量人力在短期間內分析海量的財務資訊變得更加困難。因此如何協助專業人員進行有效率的財務判斷與決策,已成為一項重要的議題。為此,本文同時提出一個以句子為分析單位的財報風險語句偵測系統~RiskFinder~,依照~fastText~與~Siamese CBOW~兩種模型,經由~10-K~財務報表與人工標記資料集學習出適當的風險語句分類器後,對~1996~至~2013~年的美國上市公司財務報表進行財報句子的自動風險預測,讓財務專業人士能透過系統的協助,有效率地由大量財務文本中獲得有意義的財務資訊。此外,系統會依照公司的財報發布日期動態呈現股票交易資訊與後設資料,以利使用者依股價的時間走勢比較財務文字型與數值型資料的關係。zh_TW
dc.description.abstract (摘要) The main purpose of this paper is to evaluate the risk of financial report of listed companies in sentence-level. Most of past sentiment analysis studies focused on word-level risk detection. However, most financial keywords are highly context-sensitive, which may likely yield biased results. Therefore, to advance the understanding of financial textual information, this thesis broadens the analysis from word-level to sentence level. We use two sentence-level models, fastText and Siamese-CBOW, to learn sentence embedding and attempt to facilitate the financial risk detection. In our experiment, we use the 10-K corpus and a financial sentiment dataset which were labeled by financial professionals to train our financial risk classifier. Moreover, we adopt the Bag-of-Word model as a baseline and use accuracy, precision, recall and F1-score to evaluate the performance of financial risk prediction. The experimental results show that the embedding models could lead better performance than the Bag-of-word model. In addition, this paper proposes a web-based financial risk detection system which is constructed based on fastText and Siamese CBOW model called RiskFinder. There are total 40,708 financial reports inside the system and each risk-related sentence is highlighted based on different sentence embedding models. Besides, our system also provides metadata and a visualization of financial time-series data for the corresponding company according to release day of financial report. This system considerably facilitates case studies in the field of finance and can be of great help in capturing valuable insight within large amounts of textual information.en_US
dc.description.tableofcontents 致謝 1
中文摘要 2
Abstract 3
第一章 緒論 1
1.1 研究背景 1
1.2 傳統財務風險預測方法及其限制 1
1.3 研究目標 2
第二章 相關文獻探討 4
2.1 財務風險預測 4
2.2 文字情緒分析與詞向量表示法 5
第三章 研究方法 8
3.1 Word2Vec 8
3.2 fastText 10
3.2.1 基於字根的詞象量學習法 10
3.2.2 句向量的線性分類器 11
3.3 Siamese CBOW 12
3.4 財報風險語句標記資料集 14
第四章 實驗結果與討論 16
4.1 實驗設定 16
4.1.1 資料集搜集及資料前處理 16
4.1.2 量化評估標準 19
4.2 實驗結果分析與討論 20
4.3 小結 24
第五章 財報風險語句偵測系統 25
5.1 設計目的 25
5.2 操作介面 26
5.3 案例分析 29
5.4 小結 31
第六章 結論 32
參考文獻 34
zh_TW
dc.format.extent 2770686 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G1047530352en_US
dc.subject (關鍵詞) 文字探勘zh_TW
dc.subject (關鍵詞) 財務風險zh_TW
dc.subject (關鍵詞) 情緒分析zh_TW
dc.subject (關鍵詞) 機器學習zh_TW
dc.subject (關鍵詞) Text miningen_US
dc.subject (關鍵詞) Financial risken_US
dc.subject (關鍵詞) Sentiment analysisen_US
dc.subject (關鍵詞) Machine learningen_US
dc.title (題名) 財報文字分析之句子風險程度偵測研究zh_TW
dc.title (題名) Risk-related Sentence Detection in Financial Reportsen_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) [1] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A Neural Probabilistic Language Model. Journal of Machine Learning Research, 3(Feb):1137–1155, 2003.
[2] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching Word Vectors with Subword Information. arXiv preprint arXiv:1607.04606, 2016.
[3] J. L. Campbell, H. Chen, D. S. Dhaliwal, H.-m. Lu, and L. B. Steele. The Information Content of Mandatory Risk Factor Disclosures in Corporate Filings. Review of Accounting Studies, 19(1):396–455, 2014.
[4] D. J. Denis and I. Osobov. Why do Firms Pay Dividends? International Evidence on The Determinants of Dividend Policy. Journal of Financial Economics, 89(1):62–82, 2008.
[5] E. F. Fama and K. R. French. Industry Costs of Equity. Journal of Financial Economics, 43(2):153–193, 1997.
[6] M. Faruqui, J. Dodge, S. K. Jauhar, C. Dyer, E. Hovy, and N. A. Smith. Retrofitting Word Vectors to Semantic Lexicons. arXiv preprint arXiv:1411.4166, 2014.
[7] N. Jegadeesh and D. Wu. Word Power: A New Approach for Content Analysis. Journal of Financial Economics, 110(3):712–729, 2013.
[8] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov. Bag of Tricks for Efficient Text Classification. arXiv preprint arXiv:1607.01759, 2016.
[9] T. Kenter, A. Borisov, and M. de Rijke. Siamese Cbow: Optimizing Word Embeddings for Sentence Representations. arXiv preprint arXiv:1606.04640, 2016.
[10] S. Kogan, D. Levin, B. R. Routledge, J. S. Sagi, and N. A. Smith. Predicting risk from Financial Reports with Regression. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association
for Computational Linguistics, pages 272–280. Association for Computational Linguistics, 2009.
[11] T. Loughran and B. McDonald. When is a Liability not a Liability? Textual Analysis, Dictionaries, and 10-ks. The Journal of Finance, 66(1):35–65, 2011.
[12] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781, 2013.
[13] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed Representations of Words and Phrases and Their Compositionality. In Advances in Neural Information Processing Systems, pages 3111–3119, 2013.
[14] N. Rekabsaz, M. Lupu, A. Baklanov, A. Hanbury, A. D¨ur, and L. Anderson. Volatility Prediction Using Financial Disclosures Sentiments with Word Embedding-Based IR Models. arXiv preprint arXiv:1702.01978, 2017.
[15] W. F. Sharpe. Capital Asset Prices: A Theory of Market Equilibrium Under Conditions of Risk. The Journal of Finance, 19(3):425–442, 1964.
[16] M.-F. Tsai, C.-J. Wang, and P.-C. Chien. Discovering Finance Keywords via Continuous-space Language Models. ACM Transactions on Management Information Systems (TMIS), 7(3):7, 2016.
[17] C.-J. Wang, M.-F. Tsai, T. Liu, and C.-T. Chang. Financial Sentiment Analysis for Risk Prediction. In Proceedings of the 6th International Joint Conference on Natural Language Processing., pages 802–808, 2013.
[18] S. Wang and C. D. Manning. Baselines and Bigrams: Simple, Good Sentiment and Topic Classification. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2, pages 90–94. Association
for Computational Linguistics, 2012.
zh_TW