學術產出-學位論文
文章檢視/開啟
書目匯出
-
題名 一個對單篇中文文章擷取關鍵字之演算法
A Keyword Extraction Algorithm for Single Chinese Document作者 吳泰勳
Wu, Tai Hsun貢獻者 徐國偉
Hsu, Kuo Wei
吳泰勳
Wu, Tai Hsun關鍵詞 關鍵字擷取
單篇中文文章
Keyword Extraction
single Chinese document日期 2013 上傳時間 2-一月-2014 14:07:20 (UTC+8) 摘要 數位典藏與數位學習國家型科技計畫14年來透過數位化方式典藏國家文物,例如:生物、考古、地質等15項主題,為了能讓數位典藏資料與時事互動故使用關鍵字作為數位典藏資料與時事的橋樑,由於時事資料會出現新字詞,因此,本研究將提出一個演算法在不使用詞庫或字典的情況下對單一篇中文文章擷取主題關鍵字,此演算法是以Bigram的方式斷詞因此字詞最小單位為二個字,例如:「中文」,隨後挑選出頻率詞並採用分群的方式將頻率詞進行分群最後計算每個字詞的卡方值並產生主題關鍵字,在文章中字詞共現的分佈是很重要的,假設一字詞與所有頻率詞的機率分佈中,此字詞與幾個頻率詞的機率分佈偏差較大,則此字詞極有可能為一關鍵字。在字詞的呈現方面,中文句子裡不像英文句子裡有明顯的分隔符號隔開每一個字詞,造成中文在斷詞處理上產生了極大的問題,與英文比較起來中文斷詞明顯比英文來的複雜許多,在本研究將會比較以Bigram、CKIP和史丹佛中文斷詞器為斷詞的工具,分別進行過濾或不過濾字詞與對頻率詞分群或不分群之步驟,再搭配計算卡方值或詞頻後所得到的主題關鍵字之差異,實驗之資料將採用中央研究院數位典藏資源網的文章,文章的標準答案則來自於中央研究院資訊科學研究所電腦系統與通訊實驗室所開發的撈智網。從實驗結果得知使用Bigram斷詞所得到的主題關鍵字部分和使用CKIP或史丹佛中文斷詞器所得到的主題關鍵字相同,且部分關鍵字與文章主題的關聯性更強,而使用Bigram斷詞的主要優點在於不用詞庫。最後,本研究所提出之演算法是基於能將數位典藏資料推廣出去的前提下所發展,希望未來透過此演算法能從當下熱門話題的文章擷取出主題關鍵字,並透過主題關鍵字連結到相關的數位典藏資料,進而帶動新一波「數典潮」。
In the past 14 years, Taiwan e-Learning and Digital Archives Program has developed digital archives of organism, archaeology, geology, etc. There are 15 topics in the digital archives. The goal of the work presented in this thesis is to automatically extract keyword s in documents in digital archives, and the techniques developed along with the work can be used to build a connection between digital archives and news articles. Because there are always new words or new uses of words in news articles, in this thesis we propose an algorithm that can automatically extract keywords from a single Chinese document without using a corpus or dictionary. Given a document in Chinese, initially the algorithm uses a bigram-based approach to divide it into bigrams of Chinese characters. Next, the algorithm calculates term frequencies of bigrams and filters out those with low term frequencies. Finally, the algorithm calculates chi-square values to produce keywords that are most related to the topic of the given document. The co-occurrence of words can be used as an indicator for the degree of importance of words. If a term and some frequent terms have similar distributions of co-occurrence, it would probably be a keyword. Unlike English word segmentation which can be done by using word delimiters, Chinese word segmentation has been a challenging task because there are no spaces between characters in Chinese. The proposed algorithm performs Chinese word segmentation by using a bigram-based approach, and we compare the segmented words with those given by CKIP and Stanford Chinese Segmenter. In this thesis, we present comparisons for different settings: One considers whether or not infrequent terms are filtered out, and the other considers whether or not frequent terms are clustered by a clustering algorithm. The dataset used in experiments is downloaded from the Academia Sinica Digital Resources and the ground truth is provided by Gainwisdom, which is developed by Computer Systems and Communication Lab in Academia Sinica. According to the experimental results, some of the segmented words given by the bigram-based approach adopted in the proposed algorithm are the same as those given by CKIP or Stanford Chinese Segmenter, while some of the segmented words given by the bigram-based approach have stronger connections to topics of documents. The main advantage of the bigram-based approach is that it does not require a corpus or dictionary.參考文獻 [1] 計畫起緣,http://wiki.teldap.tw/index.php/%E6%95%B8%E4%BD%8D%E5%85%B8%E8%97%8F%E8%88%87%E6%95%B8%E4%BD%8D%E5%AD%B8%E7%BF%92%E5%9C%8B%E5%AE%B6%E5%9E%8B%E7%A7%91%E6%8A%80%E8%A8%88%E7%95%AB (2013/9/1).[2] 聯合目錄,http://catalog.digitalarchives.tw(2013/9/1).[3] 了解數位典藏,http://digiarch.sinica.edu.tw/content/about/about.jsp(2013/9/5).[4] 數位典藏資源網 , http://digiarch.sinica.edu.tw/index.jsp(2013/9/10).[5] Liu, Z., Chen, X., and Sun, M. (2012). Mining the interests of Chinese microbloggers via keyword extraction. Frontiers of Computer Science, 6(1):76–87.[6] Liu, F., Liu, F., Liu, Y. (2011). A Supervised Framework for Keyword Extraction From Meeting Transcripts. IEEE Transactions on Audio Speech and Language Processing 19, 538–548.[7] Luo, X., et al. (2008). Experimental study on the extraction and distribution of textual domain keywords. Concurrency and Computation-Practice & Experience 20(16), 1917–1932.[8] Bracewell David, B., et al. (2008). Single document keyword extraction for Internet news articles. International Journal of Innovative Computing Information and Control 4(4), 905–913.[9] Sun Yue-heng. (2005). Research of NLP Technologies Based on Statistics and its Application in Chinese Information Retrieval, Tianjing University, Tianjing, pp.27-30.[10] Dai, Y. B., Khoo, S. G. T., Loh, T. E. (1999). A new statistical formula for Chinese word segmentation incorporating contextual information. In: Proc. of the 22nd ACM SIGIR Conf. on Research and Development in Information Retrieval (pp 82–89).[11] Yu, H. K., Zhang, H. P., Liu, Q., Lv, X. Q. and Shi, S. C. (2006).Chinese named entity identification using cascaded hiddenMarkov model. Journal on Communications, 27(2), 87–94.[12] N-gram,http://en.wikipedia.org/wiki/N-gram(2013/8/13).[13] 蘇辰豫,在跨多語言資訊檢索中使用N-gram翻譯及維基百科翻譯解決未知詞問題,朝陽科技大學,2007。[14] 洪大弘,基於語言模型及正反面語料知識庫之中文錯別字自動偵錯系統,朝陽科技大學,2009。[15] 莊怡軒,英文技術文獻中動詞與其受詞之中文翻譯的語境效用,國立政治大學,2011。[16] 王瑞平,應用平行語料建構中文斷詞組件,國立政治大學,2012。[17] 蘇信州,TFT-LCD面板製造廠CIM客服中心之案例式推理模式建立,國立成功大學,2009。[18] CKIP,http://ckipsvr.iis.sinica.edu.tw/intro.htm(2013/8/9)[19] 廖嘉新,實體論自動建構技術與其在資訊分類上之應用,國立成功大學,2002。[20] 馮廣明,正面和負面資訊需求對資訊檢索效能之影響研究,國立台灣大學,2003。[21] 蘇柏鳴,應用事件導向負面情緒預測網路使用者憂鬱傾向,國立成功大學,2012。[22] 李怡欣,國小中年級社會教科書詞彙分析-以翰林版為例,國立台南大學,2012。[23] Giarlo, M. J. (2005). A Comparative Analysis of Keyword Extraction Techniques. Rutgers,The State University of New Jersey.[24] Ercan, G., & Cicekli, I. (2007). Using Lexical Chains for Keyword Extraction. Information Processing & Management, Vol.43, Issue 6, pp. 1705-1714.[25] Dipl.-Ing. Wolfgang Nejdl. (2009). Automatic Keyword Extraction for Database Search.[26] J. D. Cohen. (1995). Language and domain-independent automatic indexing terms for abstracting. Journal of the American Society for Information Science.[27] I. Witten, G. Paynte, E. Frank, C. (1999). Gutwin, C. Nevill-Manning. KEA: practical automatic keyphrase extraction. In Proceedings of the 4th ACM Conference on Digital Library.[28] A. Hulth. (2003). Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the 2003 Conference on Emprical Methods in Natural Language Processing, Sapporo, Japan.[29] J. B. Keith Humphreys. (2002). Phraserate: An HTML keyphrase extractor. Technical Report.[30] Songhua Xu, Shaohui Yang, and Francis Chi-Moon Lau. (2010).Keyword extraction and headline generation using novel word features. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2010. AAAI Press.[31] Zhenhui Li, Ging Zhou, Yun-Fang Juan, and Jiawei Han. (2010). Keyword extraction for social snippets. In Proceedings of the WWW, pages 1143-1144.[32] X. Wu and A. Bolivar. (2008). Keyword extraction for contextual advertisement. In Proc. of WWW, pages 1195–1196.[33] Y. Matsuo, M. Ishizuka. (2004). Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools.[34] G. K. Palshikar. (2007). Keyword extraction from a single document using centrality measures. In Proceedings of the 2nd International Conference on Pattern Recognition and Machine Intelligence(LNCS-4815), pp. 503–510.[35] Yan Yang, Meng Qiu. (2011). Exploration and Improvement in Keyword Extraction for News Based on TFIDF. 2011 3rd International Conference on Machine Learning and Computing.[36] 詹權恩,以詞彙關聯性詞庫為基礎之文件關鍵字擷取模式,國立清華大學,2004。[37] Hui Jiao, Qian Liu, Hui-bo Jia, (2007). Chinese Keyword Extraction Based on N-gram and Word Co-occurrence. 2007 International Conference on Computational Intelligence and Security Workshops.[38] Xinghua Li , Xindong Wu , Xuegang Hu , Fei Xie , Zhaozhong Jiang. (2008). Keyword Extraction Based on Lexical Chains and Word Co-occurrence for Chinese News Web Pages. Proceedings of the 2008 IEEE International Conference on Data Mining Workshops, p.744-751, December 15-19.[39] 撈智網, http://gainwisdom.iis.sinica.edu.tw/index.jsp(2013/9/10).[40] Precision and recall,http://en.wikipedia.org/wiki/Precision_and_recall(2013/11/15).[41] Zhang Le, Lu Xue-qiang, Shen Yan-na and Yao Tian-shun, Y. (2003). A Statistical Approach to Extract Chinese Chunk Candidates from Large Corpora. 20th International Conference on Computer Processing of Oriental Languages.[42] 黃佳新,關鍵字擷取與文件分類之因子分析,國立清華大學,2004。 描述 碩士
國立政治大學
資訊科學學系
100971017
102資料來源 http://thesis.lib.nccu.edu.tw/record/#G0100971017 資料類型 thesis dc.contributor.advisor 徐國偉 zh_TW dc.contributor.advisor Hsu, Kuo Wei en_US dc.contributor.author (作者) 吳泰勳 zh_TW dc.contributor.author (作者) Wu, Tai Hsun en_US dc.creator (作者) 吳泰勳 zh_TW dc.creator (作者) Wu, Tai Hsun en_US dc.date (日期) 2013 en_US dc.date.accessioned 2-一月-2014 14:07:20 (UTC+8) - dc.date.available 2-一月-2014 14:07:20 (UTC+8) - dc.date.issued (上傳時間) 2-一月-2014 14:07:20 (UTC+8) - dc.identifier (其他 識別碼) G0100971017 en_US dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/63217 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 資訊科學學系 zh_TW dc.description (描述) 100971017 zh_TW dc.description (描述) 102 zh_TW dc.description.abstract (摘要) 數位典藏與數位學習國家型科技計畫14年來透過數位化方式典藏國家文物,例如:生物、考古、地質等15項主題,為了能讓數位典藏資料與時事互動故使用關鍵字作為數位典藏資料與時事的橋樑,由於時事資料會出現新字詞,因此,本研究將提出一個演算法在不使用詞庫或字典的情況下對單一篇中文文章擷取主題關鍵字,此演算法是以Bigram的方式斷詞因此字詞最小單位為二個字,例如:「中文」,隨後挑選出頻率詞並採用分群的方式將頻率詞進行分群最後計算每個字詞的卡方值並產生主題關鍵字,在文章中字詞共現的分佈是很重要的,假設一字詞與所有頻率詞的機率分佈中,此字詞與幾個頻率詞的機率分佈偏差較大,則此字詞極有可能為一關鍵字。在字詞的呈現方面,中文句子裡不像英文句子裡有明顯的分隔符號隔開每一個字詞,造成中文在斷詞處理上產生了極大的問題,與英文比較起來中文斷詞明顯比英文來的複雜許多,在本研究將會比較以Bigram、CKIP和史丹佛中文斷詞器為斷詞的工具,分別進行過濾或不過濾字詞與對頻率詞分群或不分群之步驟,再搭配計算卡方值或詞頻後所得到的主題關鍵字之差異,實驗之資料將採用中央研究院數位典藏資源網的文章,文章的標準答案則來自於中央研究院資訊科學研究所電腦系統與通訊實驗室所開發的撈智網。從實驗結果得知使用Bigram斷詞所得到的主題關鍵字部分和使用CKIP或史丹佛中文斷詞器所得到的主題關鍵字相同,且部分關鍵字與文章主題的關聯性更強,而使用Bigram斷詞的主要優點在於不用詞庫。最後,本研究所提出之演算法是基於能將數位典藏資料推廣出去的前提下所發展,希望未來透過此演算法能從當下熱門話題的文章擷取出主題關鍵字,並透過主題關鍵字連結到相關的數位典藏資料,進而帶動新一波「數典潮」。 zh_TW dc.description.abstract (摘要) In the past 14 years, Taiwan e-Learning and Digital Archives Program has developed digital archives of organism, archaeology, geology, etc. There are 15 topics in the digital archives. The goal of the work presented in this thesis is to automatically extract keyword s in documents in digital archives, and the techniques developed along with the work can be used to build a connection between digital archives and news articles. Because there are always new words or new uses of words in news articles, in this thesis we propose an algorithm that can automatically extract keywords from a single Chinese document without using a corpus or dictionary. Given a document in Chinese, initially the algorithm uses a bigram-based approach to divide it into bigrams of Chinese characters. Next, the algorithm calculates term frequencies of bigrams and filters out those with low term frequencies. Finally, the algorithm calculates chi-square values to produce keywords that are most related to the topic of the given document. The co-occurrence of words can be used as an indicator for the degree of importance of words. If a term and some frequent terms have similar distributions of co-occurrence, it would probably be a keyword. Unlike English word segmentation which can be done by using word delimiters, Chinese word segmentation has been a challenging task because there are no spaces between characters in Chinese. The proposed algorithm performs Chinese word segmentation by using a bigram-based approach, and we compare the segmented words with those given by CKIP and Stanford Chinese Segmenter. In this thesis, we present comparisons for different settings: One considers whether or not infrequent terms are filtered out, and the other considers whether or not frequent terms are clustered by a clustering algorithm. The dataset used in experiments is downloaded from the Academia Sinica Digital Resources and the ground truth is provided by Gainwisdom, which is developed by Computer Systems and Communication Lab in Academia Sinica. According to the experimental results, some of the segmented words given by the bigram-based approach adopted in the proposed algorithm are the same as those given by CKIP or Stanford Chinese Segmenter, while some of the segmented words given by the bigram-based approach have stronger connections to topics of documents. The main advantage of the bigram-based approach is that it does not require a corpus or dictionary. en_US dc.description.tableofcontents 第一章 緒論 1第1.1節 背景 1第1.2節 研究動機 2第1.3節 研究目的 2第1.4節 論文架構 2第二章 文獻探討 4第2.1節 中文斷詞 4第2.1.1節 n-gram 5第2.1.2節 史丹佛中文斷詞器 6第2.1.3節 CKIP 7第2.2節 關鍵字擷取 7第2.2.1小節 英文關鍵字擷取 8第2.2.2小節 中文關鍵字擷取 9第2.2.3小節 小結 10第2.3節 字詞共現 10第三章 演算法 12第3.1節 斷詞 12第3.2節 分群 12第3.3節 卡方值計算 13第3.4節 演算法 15第四章 實驗方法與設計 19第4.1節 實驗工具 23第4.2節 資料集 23第4.3節 實驗流程 23第4.4節 實驗結果 24第4.4.1節 實驗比較組一 25第4.4.2節 實驗比較組二 26第4.4.3節 實驗比較組三 27第4.4.4節 實驗比較組四 27第4.4.5節 實驗比較組五 28第4.4.6節 實驗比較組六 29第4.5節 實驗評量 30第五章 結論與未來研究 32參考文獻 35附錄 I zh_TW dc.format.extent 5884660 bytes - dc.format.mimetype application/pdf - dc.language.iso en_US - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0100971017 en_US dc.subject (關鍵詞) 關鍵字擷取 zh_TW dc.subject (關鍵詞) 單篇中文文章 zh_TW dc.subject (關鍵詞) Keyword Extraction en_US dc.subject (關鍵詞) single Chinese document en_US dc.title (題名) 一個對單篇中文文章擷取關鍵字之演算法 zh_TW dc.title (題名) A Keyword Extraction Algorithm for Single Chinese Document en_US dc.type (資料類型) thesis en dc.relation.reference (參考文獻) [1] 計畫起緣,http://wiki.teldap.tw/index.php/%E6%95%B8%E4%BD%8D%E5%85%B8%E8%97%8F%E8%88%87%E6%95%B8%E4%BD%8D%E5%AD%B8%E7%BF%92%E5%9C%8B%E5%AE%B6%E5%9E%8B%E7%A7%91%E6%8A%80%E8%A8%88%E7%95%AB (2013/9/1).[2] 聯合目錄,http://catalog.digitalarchives.tw(2013/9/1).[3] 了解數位典藏,http://digiarch.sinica.edu.tw/content/about/about.jsp(2013/9/5).[4] 數位典藏資源網 , http://digiarch.sinica.edu.tw/index.jsp(2013/9/10).[5] Liu, Z., Chen, X., and Sun, M. (2012). Mining the interests of Chinese microbloggers via keyword extraction. Frontiers of Computer Science, 6(1):76–87.[6] Liu, F., Liu, F., Liu, Y. (2011). A Supervised Framework for Keyword Extraction From Meeting Transcripts. IEEE Transactions on Audio Speech and Language Processing 19, 538–548.[7] Luo, X., et al. (2008). Experimental study on the extraction and distribution of textual domain keywords. Concurrency and Computation-Practice & Experience 20(16), 1917–1932.[8] Bracewell David, B., et al. (2008). Single document keyword extraction for Internet news articles. International Journal of Innovative Computing Information and Control 4(4), 905–913.[9] Sun Yue-heng. (2005). Research of NLP Technologies Based on Statistics and its Application in Chinese Information Retrieval, Tianjing University, Tianjing, pp.27-30.[10] Dai, Y. B., Khoo, S. G. T., Loh, T. E. (1999). A new statistical formula for Chinese word segmentation incorporating contextual information. In: Proc. of the 22nd ACM SIGIR Conf. on Research and Development in Information Retrieval (pp 82–89).[11] Yu, H. K., Zhang, H. P., Liu, Q., Lv, X. Q. and Shi, S. C. (2006).Chinese named entity identification using cascaded hiddenMarkov model. Journal on Communications, 27(2), 87–94.[12] N-gram,http://en.wikipedia.org/wiki/N-gram(2013/8/13).[13] 蘇辰豫,在跨多語言資訊檢索中使用N-gram翻譯及維基百科翻譯解決未知詞問題,朝陽科技大學,2007。[14] 洪大弘,基於語言模型及正反面語料知識庫之中文錯別字自動偵錯系統,朝陽科技大學,2009。[15] 莊怡軒,英文技術文獻中動詞與其受詞之中文翻譯的語境效用,國立政治大學,2011。[16] 王瑞平,應用平行語料建構中文斷詞組件,國立政治大學,2012。[17] 蘇信州,TFT-LCD面板製造廠CIM客服中心之案例式推理模式建立,國立成功大學,2009。[18] CKIP,http://ckipsvr.iis.sinica.edu.tw/intro.htm(2013/8/9)[19] 廖嘉新,實體論自動建構技術與其在資訊分類上之應用,國立成功大學,2002。[20] 馮廣明,正面和負面資訊需求對資訊檢索效能之影響研究,國立台灣大學,2003。[21] 蘇柏鳴,應用事件導向負面情緒預測網路使用者憂鬱傾向,國立成功大學,2012。[22] 李怡欣,國小中年級社會教科書詞彙分析-以翰林版為例,國立台南大學,2012。[23] Giarlo, M. J. (2005). A Comparative Analysis of Keyword Extraction Techniques. Rutgers,The State University of New Jersey.[24] Ercan, G., & Cicekli, I. (2007). Using Lexical Chains for Keyword Extraction. Information Processing & Management, Vol.43, Issue 6, pp. 1705-1714.[25] Dipl.-Ing. Wolfgang Nejdl. (2009). Automatic Keyword Extraction for Database Search.[26] J. D. Cohen. (1995). Language and domain-independent automatic indexing terms for abstracting. Journal of the American Society for Information Science.[27] I. Witten, G. Paynte, E. Frank, C. (1999). Gutwin, C. Nevill-Manning. KEA: practical automatic keyphrase extraction. In Proceedings of the 4th ACM Conference on Digital Library.[28] A. Hulth. (2003). Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the 2003 Conference on Emprical Methods in Natural Language Processing, Sapporo, Japan.[29] J. B. Keith Humphreys. (2002). Phraserate: An HTML keyphrase extractor. Technical Report.[30] Songhua Xu, Shaohui Yang, and Francis Chi-Moon Lau. (2010).Keyword extraction and headline generation using novel word features. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2010. AAAI Press.[31] Zhenhui Li, Ging Zhou, Yun-Fang Juan, and Jiawei Han. (2010). Keyword extraction for social snippets. In Proceedings of the WWW, pages 1143-1144.[32] X. Wu and A. Bolivar. (2008). Keyword extraction for contextual advertisement. In Proc. of WWW, pages 1195–1196.[33] Y. Matsuo, M. Ishizuka. (2004). Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools.[34] G. K. Palshikar. (2007). Keyword extraction from a single document using centrality measures. In Proceedings of the 2nd International Conference on Pattern Recognition and Machine Intelligence(LNCS-4815), pp. 503–510.[35] Yan Yang, Meng Qiu. (2011). Exploration and Improvement in Keyword Extraction for News Based on TFIDF. 2011 3rd International Conference on Machine Learning and Computing.[36] 詹權恩,以詞彙關聯性詞庫為基礎之文件關鍵字擷取模式,國立清華大學,2004。[37] Hui Jiao, Qian Liu, Hui-bo Jia, (2007). Chinese Keyword Extraction Based on N-gram and Word Co-occurrence. 2007 International Conference on Computational Intelligence and Security Workshops.[38] Xinghua Li , Xindong Wu , Xuegang Hu , Fei Xie , Zhaozhong Jiang. (2008). Keyword Extraction Based on Lexical Chains and Word Co-occurrence for Chinese News Web Pages. Proceedings of the 2008 IEEE International Conference on Data Mining Workshops, p.744-751, December 15-19.[39] 撈智網, http://gainwisdom.iis.sinica.edu.tw/index.jsp(2013/9/10).[40] Precision and recall,http://en.wikipedia.org/wiki/Precision_and_recall(2013/11/15).[41] Zhang Le, Lu Xue-qiang, Shen Yan-na and Yao Tian-shun, Y. (2003). A Statistical Approach to Extract Chinese Chunk Candidates from Large Corpora. 20th International Conference on Computer Processing of Oriental Languages.[42] 黃佳新,關鍵字擷取與文件分類之因子分析,國立清華大學,2004。 zh_TW