中文關鍵詞偵測的探討 | 學術產出

學術產出-學位論文

文章檢視/開啟

pdf(0)

書目匯出

Google Scholar^TM

政大圖書館

學術資源探索系統

引文資訊

資料載入中...

資料載入中...

TAIR相關學術產出

Simple Record
Full Record

題名	中文關鍵詞偵測的探討 A Study of Chinese Keyword Extraction
作者	林晏辰 Lin, Yen-Chen
貢獻者	余清祥<br>鄭文惠 Yue, Ching-Syang<br>Cheng, Wen-Huei 林晏辰 Lin, Yen-Chen
關鍵詞	文字探勘關鍵詞偵測空間統計非監督學習交叉驗證 Text Mining Keyword Extraction Spatial Statistics Unsupervised Learning Cross-Validation
日期	2020
上傳時間	2-九月-2020 11:43:39 (UTC+8)
摘要	在資訊爆炸的大數據時代，需要快速有效率獲取關鍵訊息，透過搜尋引擎之類的資料檢索，配合適當字詞（或關鍵詞）可準確及有效率地找出目標。然而定義關鍵詞取決於搜尋者的目標需求，也與各文章所屬領域有關，多半仍需仰賴專家意見決定哪些詞彙是關鍵詞。近年也有不少研究聚焦於關鍵詞的數位分析，藉由相關領域專家的標示與協助，以監督學習的方式決定關鍵詞，但成效仍有待加強，猜測與沒有挑選適當的解釋變數有關。本文以量化模型決定關鍵詞為研究目標，尤其著重於測試哪些變數與關鍵詞關聯性較高；此外，本文也將提出一種非監督學習的關鍵詞挑選方法，不需依賴專家意見。本文以關鍵詞偵測為研究主題，依據詞頻及文本頻率（TF-IDF，Term Frequency Inverse Document Frequency）、卡方檢定值（Chi-Square Test）、RAKE（Rapid Automatic Keyword Extraction）、吉尼係數（Gini Index）等測量值，比較監督學習、非監督學習模型的關鍵詞偵測效果。實證分析為《新青年》、《人民日報》、《自由時報》、《蘋果日報》等屬於白話文的文本，先由人文學者標記關鍵詞（視為標準答案），再以交叉驗證比較各監督學習模型的優劣。其中，監督學習包括常見的統計學習、機器學習模型，包括：羅吉士迴歸、分類樹、隨機森林、類神經網路、支持向量機、極限梯度提升。本文也考量在沒有專家意見輔助下，結合空間分析、逐次分析等統計方法，嘗試建立非監督學習的關鍵詞偵測方法，再與上述監督學習模型比較詞偵測準確性。研究結果顯示適當變數在偵測關鍵詞扮演重要角色，而成效評估的部分由監督學習模型結果較佳，F1值高出非監督學習模型約20%，而非監督學習模型的F1值也能達到45%左右的水準。 It becomes an essential skill to effectively search and obtain important information in the era of big data. With suitable words and phrases (or keywords), the search engines can help us to acquire most relevant information. However, there are no specific rules yet for defining keywords, depending on the study goal and the nature of texts, and people usually rely on expert’s opinions to determine the keywords. Even with the expert’s feedback, the accuracy of applying quantitative models for keyword extraction is usually lower than 50%. We think the key reason is lacking appropriate explanatory variables. In this study, we should explore if including more variables can increase the accuracy. Also, we will propose an unsupervised learning method, based on the ideas from spatial analysis and sequential analysis, to detect keywords. For the variables of keyword extraction, we choose Chi-Square Test, RAKE(Rapid Automatic Keyword Extraction), and Gini’s Index, in addition to the term frequency and text frequency (TF-IDF, Term Frequency Inverse Document Frequency) which is often used in the past studies. Both statistical and machine learning models are used in this study, including logistic regression, classification tree, Random Forest, Neural Network, Support Vector Machine, and Extreme Gradient Boosting. We choose articles from “New Youth Magazine”, “People’s Daily”, “The Liberty Times”, and “Apple Daily” in empirical study, and all texts belong to modern Chinese writing style. First, humanity scholars mark keywords (treated as true answers) and then we use cross-validation to evaluate the model performance. We found that adding more variables can increase the accuracy of keyword extraction and the supervised learning models have higher accuracy. Still, the proposed unsupervised learning model can still achieve about 45% of accuracy.
參考文獻	ㄧ、中文文獻 1. 何立行、余清祥、鄭文惠（2014），「從文言到白話：《新青年》雜誌語言變化研究」，東亞觀念史期刊，第七期，頁427-454。 2. 余清祥、葉昱廷（2020）。「以文字探勘技術分析臺灣四大報文字風格」，《數位典藏與數位人文》，第六卷。 3. 吳冠輝（2019），「基於兩詞彙的序列關係建造非監督式SeqWORDS斷詞方法」，國立政治大學統計學研究所碩士論文。 4. 許承恩（2019），「關鍵詞偵測方法的比較與應用」，國立政治大學統計學系碩士論文。 5. 黃于珊（2017），「文字探勘在總體經濟上之應用-以美國聯準會會議紀錄為例」，國立政治大學金融學系研究所碩士論文。 6. 黃培軒（2017），「關鍵詞與階乘式詞彙文本分群之應用」，國立政治大學統計學系碩士論文。 7. 黃臆榤（2018），「結合語意關鍵詞與卷積神經網路之文本分類研究」，國立清華大學資訊工程學系碩士論文。 8. 謝博行（2013），「局部最長連續共同子序列與新詞組收集」，國立清華大學統計學研究所碩士論文。二、英文文獻 1. Chengzhi, Z. & Qingguo, Z. (2008). “Automatic Chinese Extraction Based on KNN for Implicit Subject Extraction. “International Symposium on Knowledge Acquisition and Modeling, Wuhan, China, ISBN 978-0-7695-3488-6, 689-692. 2. Herrera, J.P. & Pury, P.A. (2008). “Statistical Keyword Detection in Literary Corpora. “ The European Physical Journal. 3. Lu, B. & Huang, C.N. & Li, M. & Liang, W. (2009). “Extracting Keyphrases from Chinese News Articles Using TextRank and Query Log Knowledge.” In Proceedings of PACLIC 2009, 733-740. 4. Mikhailova, E.G. & Sandul, M. (2018). “Keyword Extraction from Single Russian Document.” CEUR Workshop Proceedings, 2135:30-36. 5. Sharan, A. & Siddiqi, S. (2015). “Keyword and Keyphrase Extraction Techniques: A Literature Review.” International Journal of Computer Applications, 109(2):18-23. doi: 10.5120/19161-0607. 6. Tarau, P. & Mihalcea, R. (2004). “TextRank - Bringing Order into Texts.” In Proceedings of the Conference on Empirical Methods in Natural Language Processing. EMNLP 2004, Barcelona, Spain. 7. Willyan, D.A. & Leandro, N.D.C. (2014). “A Keyword Extraction Method from Twitter Messages Represented as Graphs.” Applied Mathematics and Computation, 240: 308-325.
描述	碩士國立政治大學統計學系 107354026
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0107354026
資料類型	thesis

dc.contributor.advisor	余清祥<br>鄭文惠	zh_TW
dc.contributor.advisor	Yue, Ching-Syang<br>Cheng, Wen-Huei	en_US
dc.contributor.author (作者)	林晏辰	zh_TW
dc.contributor.author (作者)	Lin, Yen-Chen	en_US
dc.creator (作者)	林晏辰	zh_TW
dc.creator (作者)	Lin, Yen-Chen	en_US
dc.date (日期)	2020	en_US
dc.date.accessioned	2-九月-2020 11:43:39 (UTC+8)	-
dc.date.available	2-九月-2020 11:43:39 (UTC+8)	-
dc.date.issued (上傳時間)	2-九月-2020 11:43:39 (UTC+8)	-
dc.identifier (其他識別碼)	G0107354026	en_US
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/131480	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	統計學系	zh_TW
dc.description (描述)	107354026	zh_TW
dc.description.abstract (摘要)	在資訊爆炸的大數據時代，需要快速有效率獲取關鍵訊息，透過搜尋引擎之類的資料檢索，配合適當字詞（或關鍵詞）可準確及有效率地找出目標。然而定義關鍵詞取決於搜尋者的目標需求，也與各文章所屬領域有關，多半仍需仰賴專家意見決定哪些詞彙是關鍵詞。近年也有不少研究聚焦於關鍵詞的數位分析，藉由相關領域專家的標示與協助，以監督學習的方式決定關鍵詞，但成效仍有待加強，猜測與沒有挑選適當的解釋變數有關。本文以量化模型決定關鍵詞為研究目標，尤其著重於測試哪些變數與關鍵詞關聯性較高；此外，本文也將提出一種非監督學習的關鍵詞挑選方法，不需依賴專家意見。本文以關鍵詞偵測為研究主題，依據詞頻及文本頻率（TF-IDF，Term Frequency Inverse Document Frequency）、卡方檢定值（Chi-Square Test）、RAKE（Rapid Automatic Keyword Extraction）、吉尼係數（Gini Index）等測量值，比較監督學習、非監督學習模型的關鍵詞偵測效果。實證分析為《新青年》、《人民日報》、《自由時報》、《蘋果日報》等屬於白話文的文本，先由人文學者標記關鍵詞（視為標準答案），再以交叉驗證比較各監督學習模型的優劣。其中，監督學習包括常見的統計學習、機器學習模型，包括：羅吉士迴歸、分類樹、隨機森林、類神經網路、支持向量機、極限梯度提升。本文也考量在沒有專家意見輔助下，結合空間分析、逐次分析等統計方法，嘗試建立非監督學習的關鍵詞偵測方法，再與上述監督學習模型比較詞偵測準確性。研究結果顯示適當變數在偵測關鍵詞扮演重要角色，而成效評估的部分由監督學習模型結果較佳，F1值高出非監督學習模型約20%，而非監督學習模型的F1值也能達到45%左右的水準。	zh_TW
dc.description.abstract (摘要)	It becomes an essential skill to effectively search and obtain important information in the era of big data. With suitable words and phrases (or keywords), the search engines can help us to acquire most relevant information. However, there are no specific rules yet for defining keywords, depending on the study goal and the nature of texts, and people usually rely on expert’s opinions to determine the keywords. Even with the expert’s feedback, the accuracy of applying quantitative models for keyword extraction is usually lower than 50%. We think the key reason is lacking appropriate explanatory variables. In this study, we should explore if including more variables can increase the accuracy. Also, we will propose an unsupervised learning method, based on the ideas from spatial analysis and sequential analysis, to detect keywords. For the variables of keyword extraction, we choose Chi-Square Test, RAKE(Rapid Automatic Keyword Extraction), and Gini’s Index, in addition to the term frequency and text frequency (TF-IDF, Term Frequency Inverse Document Frequency) which is often used in the past studies. Both statistical and machine learning models are used in this study, including logistic regression, classification tree, Random Forest, Neural Network, Support Vector Machine, and Extreme Gradient Boosting. We choose articles from “New Youth Magazine”, “People’s Daily”, “The Liberty Times”, and “Apple Daily” in empirical study, and all texts belong to modern Chinese writing style. First, humanity scholars mark keywords (treated as true answers) and then we use cross-validation to evaluate the model performance. We found that adding more variables can increase the accuracy of keyword extraction and the supervised learning models have higher accuracy. Still, the proposed unsupervised learning model can still achieve about 45% of accuracy.	en_US
dc.description.tableofcontents	第一章緒論 1 第一節研究動機 1 第二節研究目的 2 第二章文獻回顧 4 第一節文本介紹 5 第二節 TextRank 7 第三節結巴斷詞 8 第四節變數選擇 10 第五節資料結構化 22 第三章監督學習 24 第一節統計/機器學習方法介紹 24 第二節模擬簡介 29 第三節單變數 31 第四節雙變數 32 第五節多變數 34 第六節顯著變數篩選 36 第七節文本與模型比較 37 第四章非監督學習 39 第一節資料降維 39 第二節選取關鍵詞 40 第三節參數設定 41 第四節模擬結果 42 第五節關鍵詞選取的差異 45 第五章結論與建議 47 第一節結論 47 第二節研究限制與未來建議 48 參考文獻 50 附錄 52 附錄ㄧ各文本關鍵詞 52 附錄二模型成效評估 55	zh_TW
dc.format.extent	2839839 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0107354026	en_US
dc.subject (關鍵詞)	文字探勘	zh_TW
dc.subject (關鍵詞)	關鍵詞偵測	zh_TW
dc.subject (關鍵詞)	空間統計	zh_TW
dc.subject (關鍵詞)	非監督學習	zh_TW
dc.subject (關鍵詞)	交叉驗證	zh_TW
dc.subject (關鍵詞)	Text Mining	en_US
dc.subject (關鍵詞)	Keyword Extraction	en_US
dc.subject (關鍵詞)	Spatial Statistics	en_US
dc.subject (關鍵詞)	Unsupervised Learning	en_US
dc.subject (關鍵詞)	Cross-Validation	en_US
dc.title (題名)	中文關鍵詞偵測的探討	zh_TW
dc.title (題名)	A Study of Chinese Keyword Extraction	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	ㄧ、中文文獻 1. 何立行、余清祥、鄭文惠（2014），「從文言到白話：《新青年》雜誌語言變化研究」，東亞觀念史期刊，第七期，頁427-454。 2. 余清祥、葉昱廷（2020）。「以文字探勘技術分析臺灣四大報文字風格」，《數位典藏與數位人文》，第六卷。 3. 吳冠輝（2019），「基於兩詞彙的序列關係建造非監督式SeqWORDS斷詞方法」，國立政治大學統計學研究所碩士論文。 4. 許承恩（2019），「關鍵詞偵測方法的比較與應用」，國立政治大學統計學系碩士論文。 5. 黃于珊（2017），「文字探勘在總體經濟上之應用-以美國聯準會會議紀錄為例」，國立政治大學金融學系研究所碩士論文。 6. 黃培軒（2017），「關鍵詞與階乘式詞彙文本分群之應用」，國立政治大學統計學系碩士論文。 7. 黃臆榤（2018），「結合語意關鍵詞與卷積神經網路之文本分類研究」，國立清華大學資訊工程學系碩士論文。 8. 謝博行（2013），「局部最長連續共同子序列與新詞組收集」，國立清華大學統計學研究所碩士論文。二、英文文獻 1. Chengzhi, Z. & Qingguo, Z. (2008). “Automatic Chinese Extraction Based on KNN for Implicit Subject Extraction. “International Symposium on Knowledge Acquisition and Modeling, Wuhan, China, ISBN 978-0-7695-3488-6, 689-692. 2. Herrera, J.P. & Pury, P.A. (2008). “Statistical Keyword Detection in Literary Corpora. “ The European Physical Journal. 3. Lu, B. & Huang, C.N. & Li, M. & Liang, W. (2009). “Extracting Keyphrases from Chinese News Articles Using TextRank and Query Log Knowledge.” In Proceedings of PACLIC 2009, 733-740. 4. Mikhailova, E.G. & Sandul, M. (2018). “Keyword Extraction from Single Russian Document.” CEUR Workshop Proceedings, 2135:30-36. 5. Sharan, A. & Siddiqi, S. (2015). “Keyword and Keyphrase Extraction Techniques: A Literature Review.” International Journal of Computer Applications, 109(2):18-23. doi: 10.5120/19161-0607. 6. Tarau, P. & Mihalcea, R. (2004). “TextRank - Bringing Order into Texts.” In Proceedings of the Conference on Empirical Methods in Natural Language Processing. EMNLP 2004, Barcelona, Spain. 7. Willyan, D.A. & Leandro, N.D.C. (2014). “A Keyword Extraction Method from Twitter Messages Represented as Graphs.” Applied Mathematics and Computation, 240: 308-325.	zh_TW
dc.identifier.doi (DOI)	10.6814/NCCU202001387	en_US

學術產出-學位論文

文章檢視/開啟

書目匯出

Google ScholarTM

政大圖書館

引文資訊

TAIR相關學術產出

Google Scholar^TM