Please use this identifier to cite or link to this item: https://ah.lib.nccu.edu.tw/handle/140.119/125515
題名: 關鍵詞偵測方法的比較與應用
The Application of Keywords Extraction
作者: 許承恩
Hsu, Cheng-En
貢獻者: 余清祥<br>鄭文惠
Yue, Ching-Syang<br>Cheng, Wen-Huei
許承恩
Hsu, Cheng-En
關鍵詞: 文字探勘
關鍵字擷取
數位人文
機器學習
詞頻與文本頻率
Text mining
Keyword extraction
Digital humanities
Machine learning
Term Frequency Inverse Document Frequency
日期: 2019
上傳時間: 5-Sep-2019
摘要: 近年來由於文本被大量數位化,使得文字探勘(Text Mining)成為熱門研究領域,愈來愈多研究藉由量化技術找出文字涵意,提供專家意見不同角度的語意解讀。文本在經過結構化(Structurization)後,根據不同需求如關鍵詞擷取、尋找潛在文本主題、情感分析、輿情分析等,建立統計及機器學習等數位模型。其中關鍵詞擷取可用於解讀作者想法、提升閱讀效率、掌握寫作風格以及文章出版時空背景的變化。本研究也以決定關鍵詞為研究目標,除了提出一種非監督學習的統計方法,也使用中文文本評估新方法與幾種常見關鍵詞偵測方法,包括網路流行的TF-IDF (Term Frequency Inverse Document Frequency;詞頻與文本頻率)、統計分析的羅吉斯迴歸(Logistic Regression)、常見的機器學習模型。實證分析採用《人民日報》、《新青年雜誌》兩個白話文的文本,其中《人民日報》為1971-1989年與人權有關的514篇報導,《新青年》則是第七卷(1919年)、第八卷(1920年),這些文本的篇幅大約都介於40~60萬字。先由人文學者標記出各文本的關鍵詞,將其視為標準答案,再套用上述三種方法選取可能的關鍵詞,再比較上述方法與專家意見的差異及準確率;另外,我們也將比較人工挑選、自動挑選關鍵詞的差異,並探索兼具兩種方法優點的可能。
Text Mining has become one of the popular research areas after the IBM proposed the term Big Data in 2010. Since then many texts are being digitalized and more scholars are devoted in developing quantitative tools for giving texts semantic meaning without the help of human experts. This greatly increases the efficiency of reading a hugh amount of texts provided that the texts are properly structurized. The structurization of texts includes quite a few steps, such as keyword extraction and sentiment analysis. The keyword extraction is critical and the keywords can be used to summarize an article and compare two authors’ writing styles.\nThe goal of this study is to propose a new unsupervised method for extracting keywords and compare it to some frequently used methods, including term frequency inverse document frequency (TF-IDF), logistic regression, machine learning models. In the empirical analysis, we considered three modern Chinese texts, one from People’s Daily (514 articles in 1971-1989) and two from New Youth Magazine (volumes 7 and 8 in 1919-1920). The numbers of words in all texts are approximately 400,000 to 600,000. We asked historical scholars to pick up keywords from these three texts and treat them as the true keywords. Then, we applied different keyword extraction methods to these texts and compared their results. We found that the proposed method has the best performance among all supervised methods and it is competitive to the supervised methods.
參考文獻: 一、中文文獻\n1. 何昱鋒(2019),「基於物聯網之即時環境監測系統」,碩士論文,國立臺灣海洋大學電機工程學系。\n2. 何立行、余清祥、鄭文惠(2014),「從文言到白話:《新青年》雜誌語言變化統計研究」,東亞觀念史集刊,第七期,頁427-454。\n3. 金觀濤、梁穎誼、姚育松、劉昭麟(2014),「統計偏離值分析於人文研究上的應用」,東亞觀念史集刊,第六期,頁331-366。\n4. 黃居仁(2005),「漢字知識表達的幾個層面:字、詞與詞義關係概論」,漢字與全球化國際學術研討會論文集,頁77-88。\n5. 郭益豪(2013),「以改良式N-Gram斷詞法結合潛在語意分析進行以改良式N-Gram斷詞法結合潛在語意分析進行網頁影像加註」.,碩士論文,國立雲林科技大學資訊管理系。\n6. 謝孟樺(2018),「考量上下文字詞共現關係之短文斷詞研究」,碩士論文,國立中興大學資訊科學與工程學系。\n7. 梁家安(2016),「從國共內戰到改革開放:人民日報風格變遷之量化研究」,碩士論文,國立政治大學統計研究所。\n8. 謝博行(2013),「局部最長連續共同子序列與新詞組收集」,碩士論文,國立清華大學統計學研究所。\n9. 潘豔豔(2015),「探索性資料分析方法在文本資料中的應用─以《新青年》雜誌為例」,碩士論文,國立政治大學統計研究所。\n50\n二、英文文獻\n1. Demets, D.L. and Lan, K.G. (1994). “Interim analysis: the alpha spending function approach.” Statistics in Medicine, 13(13‐14): 1341-1352.\n2. Hinton, G.E. and Roweis, S.T. (2003). “Stochastic neighbor embedding.” Advances in neural information processing systems, 857-864.\n3. Kulldorff, M. (1997). “A spatial scan statistic.” Communications in Statistics-Theory methods, 26(6): 1481-1496.\n4. Pocock, S.J. (1977). “Group sequential methods in the design and analysis of clinical trials.” Biometrika, 64(2): 191-199.\n5. Salton, G., Wong, A., and Yang, C.S. (1975). “A vector space model for automatic indexing.” Communications of the ACM, 18(11): 613-620.\n6. van der Maaten, L. and Hinton, G. (2008). “Visualizing data using t-SNE.” Journal of machine learning research, 9(Nov): 2579-2605.
描述: 碩士
國立政治大學
統計學系
106354020
資料來源: http://thesis.lib.nccu.edu.tw/record/#G0106354020
資料類型: thesis
Appears in Collections:學位論文

Files in This Item:
File SizeFormat
402001.pdf2.88 MBAdobe PDF2View/Open
Show full item record

Google ScholarTM

Check

Altmetric

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.