潛在類別分析於文字探勘之應用

學術產出-學位論文

文章檢視/開啟

pdf(13)

書目匯出

Google Scholar^TM

政大圖書館

學術資源探索系統

引文資訊

TAIR相關學術產出

Simple Record
Full Record

題名	潛在類別分析於文字探勘之應用 Applying Latent Class Analysis on Text Mining
作者	廖彥婷 Liaw, Yen-Ting
貢獻者	江振東廖彥婷 Liaw, Yen-Ting
關鍵詞	分類潛在類分析文字探勘相似性檢測 Classification Latent class analysis Similarity detection Text mining
日期	2018
上傳時間	3-七月-2018 17:23:43 (UTC+8)
摘要	現今網路的使用已經成為主流，因此在網站上擁有大量的文字信息。文字探勘也因此成為一種流行的資料分析方法。潛在類別分析(Latent Class Analysis)是一常用於社會科學的分析方法來尋找潛藏於資料背後的潛在類別。在本文中，我們應用潛在類別分析來評估此分析方法應用於文字探勘的可行性。本文中針對兩個案例進行論證和研究，一個是比較“水滸傳”和“三國演義”的相似性檢測，另一個則是針對新聞文章的分類問題來尋找關鍵詞並據此提供結論和建議。 There is a large amount of information on the website that is in text form, and due to the increment of internet usage, text mining has become a popular method for information retrieval. In this paper, we apply Latent Class Analysis (LCA), a technique that is often used in social sciences to reveal underlying latent classes, on text mining and check whether it is an appropriate method on this regard. Two study cases are demonstrated, one is similarity detection that compare two novels, Water Margin and Romance of Three Kingdom, and the other is using classification that classify the categories for news articles to find important keywords. Conclusions and suggestions are provided.
參考文獻	Aggarwal, C. C. & Zhai, C. X. (2012). Mining Text Data. New York, NY: Springer Publishing Company. Forster, M. R. (2000). Key Concepts in Model Selection: Performance and Generalizability. Journal of Mathematical Psychology, 44, 205- 231. Lin, T. H. & Dayton, C. M. (1997). Model Selection Information Criteria for Non-Nested Latent Class Models. Journal of Educational and Behavioral Statistics, 22(3), 249-264. Linzer, D. A. & Lewis, J. B. (2011). poLCA: An R Package for Polytomous Variable Latent Class Analysis. Journal of Statistical Software, 42(10), 1-29. Matsuo, Y. & Ishizuka, M. (2004). Keyword Extraction from a Single Document Using Word Co-Occurrence Statistical Information. International Journal on Artificial Intelligence Tools, 13(1), 157-169. McCutcheon, A. L. (1987). Latent Class Analysis (No.64). Thousand Oaks, CA: Sage Publications. Mittermayer, M. (2004). Forecasting Intraday Stock Price Trends with Text Mining Techniques. Proceedings of the 37th Hawaii International Conference on System Sciences. Nylund, K. L., Asparouhov, T., & Muthen, B. O. (2007). Deciding on the Number of Classes in Latent Class Analysis and Growth Mixture Modeling: A Monte Carlo Simulation Study. STRUCTURAL EQUATION MODELING,14:4, 535-569, doi: 10.1080/10705510701575396. Rosenberg, M. (1968). The Logic of Survey Analysis. New York: Basic Books. Suh, J. (2016). Comparing writing style feature-based classification methods for estimating user reputations in social media. SpringerPlus 5:261. doi: 10.1186/s40064-016-1841-1 Yue, C.J., Ho, L., Pan, Y., and Cheng, W.(2016). A Quantitative Study of Chinese Writing Style based on the New Youth Magazin, Concepts & Context in East Asia, Vol. 5. Zheng, R., Li, J., Chen, H. & Huang, Z. (2006). A Framework for Authorship Identification of Online Messages: Writing-Style Features and Classification Techniques. Journal of the American Society for Information Science and Technology, 57(3), 378-393. Zou, F., Wang, F. L., Deng, X., Han, S. & Wang, L. S. (2006). Automatic Construction of Chinese Stop Word List. Proceedings of the 5th WSEAS International Conference on Applied Computer Science, pp.1010-1015. 王曉家（1998）。水滸傳作者考論，西安：陝西人民出版社。李永祜（2011）。施耐庵和羅貫中對《水滸傳》成書的貢獻。荷澤學院學報， 33（4）， 24-37。金聖嘆、金采、曹方人、周錫山(1985）。金聖嘆全集，江蘇古籍出版社。胡適（2006）。《水滸傳》考證。荷澤學院學報，28（3），131-142。黃崇旻（2015）。論胡適《水滸傳》考證的研究方法，世新中文研究集刊，11，95-126。羅盤（1983）。水滸的事蹟、版本與作者，文訊，4，155-161。林宏仁. (2017, Dec. 13). 停用詞.txt. Retrieved from https://github.com/tomlinNTUB/Machine-Learning/tree/master/%E4%B8%AD%E6%96%87%E5%88%86%E8%A9%9E.
描述	碩士國立政治大學統計學系 105354029
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0105354029
資料類型	thesis

dc.contributor.advisor	江振東	zh_TW
dc.contributor.author (作者)	廖彥婷	zh_TW
dc.contributor.author (作者)	Liaw, Yen-Ting	en_US
dc.creator (作者)	廖彥婷	zh_TW
dc.creator (作者)	Liaw, Yen-Ting	en_US
dc.date (日期)	2018	en_US
dc.date.accessioned	3-七月-2018 17:23:43 (UTC+8)	-
dc.date.available	3-七月-2018 17:23:43 (UTC+8)	-
dc.date.issued (上傳時間)	3-七月-2018 17:23:43 (UTC+8)	-
dc.identifier (其他識別碼)	G0105354029	en_US
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/118220	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	統計學系	zh_TW
dc.description (描述)	105354029	zh_TW
dc.description.abstract (摘要)	現今網路的使用已經成為主流，因此在網站上擁有大量的文字信息。文字探勘也因此成為一種流行的資料分析方法。潛在類別分析(Latent Class Analysis)是一常用於社會科學的分析方法來尋找潛藏於資料背後的潛在類別。在本文中，我們應用潛在類別分析來評估此分析方法應用於文字探勘的可行性。本文中針對兩個案例進行論證和研究，一個是比較“水滸傳”和“三國演義”的相似性檢測，另一個則是針對新聞文章的分類問題來尋找關鍵詞並據此提供結論和建議。	zh_TW
dc.description.abstract (摘要)	There is a large amount of information on the website that is in text form, and due to the increment of internet usage, text mining has become a popular method for information retrieval. In this paper, we apply Latent Class Analysis (LCA), a technique that is often used in social sciences to reveal underlying latent classes, on text mining and check whether it is an appropriate method on this regard. Two study cases are demonstrated, one is similarity detection that compare two novels, Water Margin and Romance of Three Kingdom, and the other is using classification that classify the categories for news articles to find important keywords. Conclusions and suggestions are provided.	en_US
dc.description.tableofcontents	Table Directory 4 Figure Directory 5 1. Introduction 6 2. Literature review 7 2.1 Latent Class Analysis 8 3. Case Study 1: Similarity Detection (Water Margin and Romance of Three Kingdom) 10 3.1 Data 10 3.2 Data preprocessing 12 3.3 Text mining methodology 15 3.4 Results and discussion 22 4. Case Study 2: Keyword Extraction (News Articles from chinatimes.com) 23 4.1 Data 23 4.2 Data Preprocessing 25 4.3 Data Analysis 28 4.4 Result and Discussion 32 5. Conclusion 35 Appendix 1: The selected word for feature candidates 37 Appendix 2: The selected word for 1-word/ 2-word/ 3-word keyword 38 Appendix 3: The selected keywords for each LCA 39 Appendix 4: Code for Case Study 1 40 Appendix 5: Code for Case Study 2 42 Reference 45 Mandarin Reference 46	zh_TW
dc.format.extent	1329542 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0105354029	en_US
dc.subject (關鍵詞)	分類	zh_TW
dc.subject (關鍵詞)	潛在類分析	zh_TW
dc.subject (關鍵詞)	文字探勘	zh_TW
dc.subject (關鍵詞)	相似性檢測	zh_TW
dc.subject (關鍵詞)	Classification	en_US
dc.subject (關鍵詞)	Latent class analysis	en_US
dc.subject (關鍵詞)	Similarity detection	en_US
dc.subject (關鍵詞)	Text mining	en_US
dc.title (題名)	潛在類別分析於文字探勘之應用	zh_TW
dc.title (題名)	Applying Latent Class Analysis on Text Mining	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	Aggarwal, C. C. & Zhai, C. X. (2012). Mining Text Data. New York, NY: Springer Publishing Company. Forster, M. R. (2000). Key Concepts in Model Selection: Performance and Generalizability. Journal of Mathematical Psychology, 44, 205- 231. Lin, T. H. & Dayton, C. M. (1997). Model Selection Information Criteria for Non-Nested Latent Class Models. Journal of Educational and Behavioral Statistics, 22(3), 249-264. Linzer, D. A. & Lewis, J. B. (2011). poLCA: An R Package for Polytomous Variable Latent Class Analysis. Journal of Statistical Software, 42(10), 1-29. Matsuo, Y. & Ishizuka, M. (2004). Keyword Extraction from a Single Document Using Word Co-Occurrence Statistical Information. International Journal on Artificial Intelligence Tools, 13(1), 157-169. McCutcheon, A. L. (1987). Latent Class Analysis (No.64). Thousand Oaks, CA: Sage Publications. Mittermayer, M. (2004). Forecasting Intraday Stock Price Trends with Text Mining Techniques. Proceedings of the 37th Hawaii International Conference on System Sciences. Nylund, K. L., Asparouhov, T., & Muthen, B. O. (2007). Deciding on the Number of Classes in Latent Class Analysis and Growth Mixture Modeling: A Monte Carlo Simulation Study. STRUCTURAL EQUATION MODELING,14:4, 535-569, doi: 10.1080/10705510701575396. Rosenberg, M. (1968). The Logic of Survey Analysis. New York: Basic Books. Suh, J. (2016). Comparing writing style feature-based classification methods for estimating user reputations in social media. SpringerPlus 5:261. doi: 10.1186/s40064-016-1841-1 Yue, C.J., Ho, L., Pan, Y., and Cheng, W.(2016). A Quantitative Study of Chinese Writing Style based on the New Youth Magazin, Concepts & Context in East Asia, Vol. 5. Zheng, R., Li, J., Chen, H. & Huang, Z. (2006). A Framework for Authorship Identification of Online Messages: Writing-Style Features and Classification Techniques. Journal of the American Society for Information Science and Technology, 57(3), 378-393. Zou, F., Wang, F. L., Deng, X., Han, S. & Wang, L. S. (2006). Automatic Construction of Chinese Stop Word List. Proceedings of the 5th WSEAS International Conference on Applied Computer Science, pp.1010-1015. 王曉家（1998）。水滸傳作者考論，西安：陝西人民出版社。李永祜（2011）。施耐庵和羅貫中對《水滸傳》成書的貢獻。荷澤學院學報， 33（4）， 24-37。金聖嘆、金采、曹方人、周錫山(1985）。金聖嘆全集，江蘇古籍出版社。胡適（2006）。《水滸傳》考證。荷澤學院學報，28（3），131-142。黃崇旻（2015）。論胡適《水滸傳》考證的研究方法，世新中文研究集刊，11，95-126。羅盤（1983）。水滸的事蹟、版本與作者，文訊，4，155-161。林宏仁. (2017, Dec. 13). 停用詞.txt. Retrieved from https://github.com/tomlinNTUB/Machine-Learning/tree/master/%E4%B8%AD%E6%96%87%E5%88%86%E8%A9%9E.	zh_TW
dc.identifier.doi (DOI)	10.6814/THE.NCCU.STAT.004.2018.B03	-

學術產出-學位論文

文章檢視/開啟

書目匯出

Google ScholarTM

政大圖書館

引文資訊

TAIR相關學術產出

Google Scholar^TM