學術產出-Theses

Article View/Open

Publication Export

Google ScholarTM

政大圖書館

Citation Infomation

  • No doi shows Citation Infomation
題名 中文文本探勘工具:主題分析、詞組關聯強度、相關句擷取
Tools for Chinese Text Mining: Topic Analysis, Association Strengths of Collocations, Extraction of Relevant Statements
作者 林書佑
Lin, Shu Yu
貢獻者 劉昭麟
Liu, Chao Lin
林書佑
Lin, Shu Yu
關鍵詞 文本探勘
主題分析
詞組關聯強度
相關句擷取
Text Mining
Topic Analysis
Association Strengths of Collocations
Extraction of Relevant Statements
日期 2016
上傳時間 2-May-2016 13:55:23 (UTC+8)
摘要 現今資料大量且快速數位化的時代,各領域對資訊探勘分析技術越趨倚重。而在數位人文中領域中從2009年「數位典藏與數位人文國際研討會」開始,此議題逐漸受到重視,主要目的為將數位文物結合資訊分析與圖像化輔助,透過不同層面的詮釋建構出更完整的文物資訊。
本研究建構一個針對各種中文語料分析的工具,藉由latent semantic analysis、pointwise mutual information、Person’s chi-squared test、typed dependencies distance、word2vec、Gibbs sampling for latent Dirichlet allocation等計算語料中關鍵詞彙關聯強度的方法,並結合分群方法找出可能的主題,最後擷取符合分群結果的相關句子予以輔助人文學者分析詮釋。透過提供各種觀察語料的面向,進而提升語料相關研究學者的效率。
我們利用《人民日報》、《新青年》、《聯合報》、《中國時報》作為實驗與測試的中文語料。且將《新青年》藉由此套工具分析後的結果提供給專業人文學者,做為分析詮釋的參考資訊與佐證依據,並在「2015年數位典藏與數位人文國際研討會」中發表論文。目前我們透過各種中文語料評估工具的效能,且在未來將公開此套工具提供給更多學者使用,節省對於語料分析的時間。
In recent years, a wide variety of text documents have been transformed into digital format. Hence, using data mining techniques to analyze data is becoming more and more popular in many research fields. The digital humanities gradually have taken seriously since "International Conference of Digital Archives and Digital Humanities" began in 2009. The main purpose of the digital heritage combined with information analysis and visualization could improve the effectiveness of cultural information through different levels of interpretation.
In this study, we construct a set of tools for Chinese text mining, calculating associated strengths of collocations work through latent semantic analysis, pointwise mutual information, Person’s chi-squared test, typed dependencies distance, word2vec, and Gibbs sampling for latent Dirichlet allocation etc. The tools employ clustering method to identify the possible topics, meanwhile, the tools will extract the relevant statements according to the clustering results. These clustering and relevant statements contribute and improve the efficiency of humanities scholars’ analysis through providing a variety of observations about the corpora.
At the experimental stage of this study, we considered the "People`s Daily", "New Youth", "United Daily News", and "China Times" as as the corpora for testing. Among the research, humanities scholars analyzed the "New Youth" by the tools and published a paper in the "2015 International Conference of Digital Archives and Digital Humanities". Currently, we assess the effectiveness of the tools through a variety of Chinese corpora. In the future, we will make the tools freely available on the Internet for Chinese text mining. We hope these time-saving tools can assist in humanities scholars’ study of Chinese corpora.
參考文獻 [1] 人民日報,http://paper.people.com.cn/。
[2] 中國近現代思想及文學史專業數據庫文獻簡介,http://digibase.ssic.nccu.edu.tw/?m=2302&wsn=0300。
[3] 台灣數位人文小小讚,https://sites.google.com/site/taiwandigitalhumanities/。
[4] 金觀濤。數位人文研究的理論基礎,數位人文研究的新視野:基礎與想像,項潔編,45-61,臺灣大學出版中心,臺灣,2011。
[5] 金觀濤、邱偉雲、梁穎誼、陳柏聿、沈錳坤、及劉青峰。觀念群變化的數位人文研究-以《新青年》為例,2014第五屆數位典藏與數位人文國際研討會,臺灣,2014。
[6] 金觀濤、邱偉雲、及劉昭麟。「共現」詞頻分析及其運用─以「華人」觀念起源為例,2011年第三屆數位典藏與數位人文國際研討會論文集,199-223,臺灣,2011。
[7] 項潔、翁稷安。導論―關於數位人文的思考:理論與方法,數位人文研究的新視野:基礎與想像,項潔編,臺灣大學出版中心,9-18,臺灣,2011。
[8] 新青年簡介,http://zh.wikipedia.org/zh-tw/新青年。
[9] 劉昭麟、金觀濤、劉青峰、邱偉雲、及姚育松。自然語言處理技術於中文史學文獻分析之初步應用,2011第三屆數位典藏與數位人文國際研討會論文集,151-168,臺灣,2011。
[10] John Aldrich. R.A. Fisher and the making of maximum likelihood 1912-1922, Statistical Science, 162-176, 1997.
[11] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet allocation, Journal of Machine Learning Research, 993–1022, 2003.
[12] Lee-Feng Chien. PAT-tree-based adaptive keyphrase extraction for intelligent chinese information retrieval, Information Processing and Management, 501-521, 1999.
[13] Kenneth Ward Church , Patrick Hanks. Word association norms, mutual information, and lexicography, Compute Linguist , 22–29, 1990.
[14] Garry A. Einicke. Smoothing, Filtering and Prediction: Estimating the Past, Present and Future, InTech, 2012.
[15] George William Furnas, Scott Deerwester, Susan T. Dumais, Thomas K. Landauer, and Richard Harshman. Indexing by latent semantic analysis, Journal of The American Society for Information Science, 391—407, USA,1990.
[16] Jiawei Han, Micheline Kamber, Morgan Kaufmann. Data Mining: Concepts and Techniques, Morgan Kaufmann, 2012.
[17] Trevor John Hastie, Robert Tibshirani. Generalized Additive Models, Chapman & Hall/CRC, 1990.
[18] JAMA, http://math.nist.gov/javanumerics/jama/
[19] Leonard Kaufman, Peter J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis, WILEY, 2005.
[20] Chao-Lin Liu, Guantao Jin, Qingfeng Liu, Wei-Yun Chiu, and Yih-Soong Yu. Some chances and challenges in applying language technologies to historical studies in chinese, International Journal of Computational Linguistics and Chinese Language Processing, 27‒46, 2011.
[21] Yang Liu, Minghui Qiu, Swapna Gottipati, Feida Zhu, Jing Jiang, Huiping Sun, and Zhong Chen. CQARank: Jointly Model Topics and Expertise in Community Question Answering. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management, USA, 2013.
[22] Christopher D. Manning, Hinrich Schütze, Foundations of Statistical Natural Language Processing, MIT Press, 1999.
[23] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space, In Proceedings of Workshop at ICLR, 2013.
[24] PAT Tree , http://www.openfoundry.org/of/projects/367/.
[25] Karl Pearson , http://en.wikipedia.org/wiki/Karl_Pearson.
[26] SRI Language Modeling ( SRILM ) , http://www.speech.sri.com/projects/srilm/.
[27] Stanford Part-Of-Speech Tagger, http://nlp.stanford.edu/software/tagger.shtml.
[28] Stanford Type Dependencies , http://nlp.stanford.edu/software/lex-parser.shtml.
[29] Stanford Word Segmenter , http://nlp.stanford.edu/software/segmenter.shtml.
[30] Lloyd N. Trefethen, David Bau, III. Numerical linear algebra, Siam, 1997.
[31] WEKA , http://www.cs.waikato.ac.nz/ml/weka/.
[32] Xiao-guang Wang, Mitsuyuki Inaba. Structure and evolution of digital humanities: empirical research based on correspondence and co-word analyses, 從保存到創造:開啟數位人文研究,97-112,臺北:國立臺灣大學出版中心,2011。
描述 碩士
國立政治大學
資訊科學學系
102753020
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0102753020
資料類型 thesis
dc.contributor.advisor 劉昭麟zh_TW
dc.contributor.advisor Liu, Chao Linen_US
dc.contributor.author (Authors) 林書佑zh_TW
dc.contributor.author (Authors) Lin, Shu Yuen_US
dc.creator (作者) 林書佑zh_TW
dc.creator (作者) Lin, Shu Yuen_US
dc.date (日期) 2016en_US
dc.date.accessioned 2-May-2016 13:55:23 (UTC+8)-
dc.date.available 2-May-2016 13:55:23 (UTC+8)-
dc.date.issued (上傳時間) 2-May-2016 13:55:23 (UTC+8)-
dc.identifier (Other Identifiers) G0102753020en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/89066-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊科學學系zh_TW
dc.description (描述) 102753020zh_TW
dc.description.abstract (摘要) 現今資料大量且快速數位化的時代,各領域對資訊探勘分析技術越趨倚重。而在數位人文中領域中從2009年「數位典藏與數位人文國際研討會」開始,此議題逐漸受到重視,主要目的為將數位文物結合資訊分析與圖像化輔助,透過不同層面的詮釋建構出更完整的文物資訊。
本研究建構一個針對各種中文語料分析的工具,藉由latent semantic analysis、pointwise mutual information、Person’s chi-squared test、typed dependencies distance、word2vec、Gibbs sampling for latent Dirichlet allocation等計算語料中關鍵詞彙關聯強度的方法,並結合分群方法找出可能的主題,最後擷取符合分群結果的相關句子予以輔助人文學者分析詮釋。透過提供各種觀察語料的面向,進而提升語料相關研究學者的效率。
我們利用《人民日報》、《新青年》、《聯合報》、《中國時報》作為實驗與測試的中文語料。且將《新青年》藉由此套工具分析後的結果提供給專業人文學者,做為分析詮釋的參考資訊與佐證依據,並在「2015年數位典藏與數位人文國際研討會」中發表論文。目前我們透過各種中文語料評估工具的效能,且在未來將公開此套工具提供給更多學者使用,節省對於語料分析的時間。
zh_TW
dc.description.abstract (摘要) In recent years, a wide variety of text documents have been transformed into digital format. Hence, using data mining techniques to analyze data is becoming more and more popular in many research fields. The digital humanities gradually have taken seriously since "International Conference of Digital Archives and Digital Humanities" began in 2009. The main purpose of the digital heritage combined with information analysis and visualization could improve the effectiveness of cultural information through different levels of interpretation.
In this study, we construct a set of tools for Chinese text mining, calculating associated strengths of collocations work through latent semantic analysis, pointwise mutual information, Person’s chi-squared test, typed dependencies distance, word2vec, and Gibbs sampling for latent Dirichlet allocation etc. The tools employ clustering method to identify the possible topics, meanwhile, the tools will extract the relevant statements according to the clustering results. These clustering and relevant statements contribute and improve the efficiency of humanities scholars’ analysis through providing a variety of observations about the corpora.
At the experimental stage of this study, we considered the "People`s Daily", "New Youth", "United Daily News", and "China Times" as as the corpora for testing. Among the research, humanities scholars analyzed the "New Youth" by the tools and published a paper in the "2015 International Conference of Digital Archives and Digital Humanities". Currently, we assess the effectiveness of the tools through a variety of Chinese corpora. In the future, we will make the tools freely available on the Internet for Chinese text mining. We hope these time-saving tools can assist in humanities scholars’ study of Chinese corpora.
en_US
dc.description.tableofcontents 第一章 緒論 1
1.1 研究背景與動機 1
1.2 研究目的 2
1.3 主要貢獻 2
1.4 論文架構 3
第二章 文獻回顧 4
2.1 數位人文相關研究 4
2.2 數位人文資料分析之相關研究 5
第三章 研究方法 7
3.1 實驗語料 8
3.2 語料前處理 9
3.2.1 語句擷取 9
3.2.2 斷詞 11
3.2.3 詞性標記 11
3.2.4 詞頻與關鍵詞彙選取 11
3.2.5 計算共現詞彙的詞頻 12
3.3 STANFORD RESEARCH INSTITUTE LANGUAGE MODELING ( SRILM ) 12
3.4 關聯式規則 13
3.4.1 Apriori algorithm 14
3.5 計算關鍵詞彙關聯強度 15
3.5.1 計算共現詞彙的距離平均值和標準差 15
3.5.2 潛在語意分析 ( Latent semantic analysis ) 16
3.5.3 點互信息 ( Pointwise mutual information ) 17
3.5.4 皮爾森卡方檢定 ( Person’s chi-squared test ) 18
3.5.5 詞彙相依性距離 ( Typed dependencies distance ) 20
3.6 GIBBS SAMPLING FOR LATENT DIRICHLET ALLOCATION 24
3.7 WORD2VEC 25
3.7.1 跳躍式模型(skip-gram model) 26
3.7.2 連續型模型(continuous bag-of-words model) 27
3.8 分群分析 27
3.8.1 傑卡德相似係數 ( Jaccard coefficient ) 28
3.8.2 階層式分群法 ( Hierarchical clustering ) 28
3.8.3 K均值分群 ( K-means clustering ) 30
第四章 關鍵詞彙分析工具功能與介面 32
4.1 關鍵詞彙分析主畫面 33
4.2 選擇關鍵詞彙介面說明 35
4.3 詞彙間距離平均值和標準差和頻率介面說明 37
4.4 皮爾森卡方檢定介面說明 39
4.5 點互信息介面說明 41
4.6 詞彙相依性距離介面說明 43
4.7 潛在語意分析介面說明 44
4.8 分群與擷取相關句介面說明 45
4.9 隱含狄利克雷分布介面說明 48
4.9.1 LDA功能執行 48
4.9.2 找LDA結果相關句功能 49
4.10 WORD2VEC介面說明 51
4.10.1 Word2vec功能執行 51
4.10.2 Word2vec相關句執行 53
第五章 實驗結果與評估 56
5.1 新青年結果評估 56
5.1.1 人文學者結果與工具評估 56
5.1.2 實驗結果分析 58
5.2 人民日報結果評估 73
5.2.1 實驗分群結果與專家分群結果比對方法 74
5.2.2 Type1~6與各年結果與相關句 76
第六章 結論與未來展望 87
6.1 結論 87
6.2 未來展望 88
參考文獻 89
zh_TW
dc.format.extent 3973168 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0102753020en_US
dc.subject (關鍵詞) 文本探勘zh_TW
dc.subject (關鍵詞) 主題分析zh_TW
dc.subject (關鍵詞) 詞組關聯強度zh_TW
dc.subject (關鍵詞) 相關句擷取zh_TW
dc.subject (關鍵詞) Text Miningen_US
dc.subject (關鍵詞) Topic Analysisen_US
dc.subject (關鍵詞) Association Strengths of Collocationsen_US
dc.subject (關鍵詞) Extraction of Relevant Statementsen_US
dc.title (題名) 中文文本探勘工具:主題分析、詞組關聯強度、相關句擷取zh_TW
dc.title (題名) Tools for Chinese Text Mining: Topic Analysis, Association Strengths of Collocations, Extraction of Relevant Statementsen_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) [1] 人民日報,http://paper.people.com.cn/。
[2] 中國近現代思想及文學史專業數據庫文獻簡介,http://digibase.ssic.nccu.edu.tw/?m=2302&wsn=0300。
[3] 台灣數位人文小小讚,https://sites.google.com/site/taiwandigitalhumanities/。
[4] 金觀濤。數位人文研究的理論基礎,數位人文研究的新視野:基礎與想像,項潔編,45-61,臺灣大學出版中心,臺灣,2011。
[5] 金觀濤、邱偉雲、梁穎誼、陳柏聿、沈錳坤、及劉青峰。觀念群變化的數位人文研究-以《新青年》為例,2014第五屆數位典藏與數位人文國際研討會,臺灣,2014。
[6] 金觀濤、邱偉雲、及劉昭麟。「共現」詞頻分析及其運用─以「華人」觀念起源為例,2011年第三屆數位典藏與數位人文國際研討會論文集,199-223,臺灣,2011。
[7] 項潔、翁稷安。導論―關於數位人文的思考:理論與方法,數位人文研究的新視野:基礎與想像,項潔編,臺灣大學出版中心,9-18,臺灣,2011。
[8] 新青年簡介,http://zh.wikipedia.org/zh-tw/新青年。
[9] 劉昭麟、金觀濤、劉青峰、邱偉雲、及姚育松。自然語言處理技術於中文史學文獻分析之初步應用,2011第三屆數位典藏與數位人文國際研討會論文集,151-168,臺灣,2011。
[10] John Aldrich. R.A. Fisher and the making of maximum likelihood 1912-1922, Statistical Science, 162-176, 1997.
[11] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet allocation, Journal of Machine Learning Research, 993–1022, 2003.
[12] Lee-Feng Chien. PAT-tree-based adaptive keyphrase extraction for intelligent chinese information retrieval, Information Processing and Management, 501-521, 1999.
[13] Kenneth Ward Church , Patrick Hanks. Word association norms, mutual information, and lexicography, Compute Linguist , 22–29, 1990.
[14] Garry A. Einicke. Smoothing, Filtering and Prediction: Estimating the Past, Present and Future, InTech, 2012.
[15] George William Furnas, Scott Deerwester, Susan T. Dumais, Thomas K. Landauer, and Richard Harshman. Indexing by latent semantic analysis, Journal of The American Society for Information Science, 391—407, USA,1990.
[16] Jiawei Han, Micheline Kamber, Morgan Kaufmann. Data Mining: Concepts and Techniques, Morgan Kaufmann, 2012.
[17] Trevor John Hastie, Robert Tibshirani. Generalized Additive Models, Chapman & Hall/CRC, 1990.
[18] JAMA, http://math.nist.gov/javanumerics/jama/
[19] Leonard Kaufman, Peter J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis, WILEY, 2005.
[20] Chao-Lin Liu, Guantao Jin, Qingfeng Liu, Wei-Yun Chiu, and Yih-Soong Yu. Some chances and challenges in applying language technologies to historical studies in chinese, International Journal of Computational Linguistics and Chinese Language Processing, 27‒46, 2011.
[21] Yang Liu, Minghui Qiu, Swapna Gottipati, Feida Zhu, Jing Jiang, Huiping Sun, and Zhong Chen. CQARank: Jointly Model Topics and Expertise in Community Question Answering. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management, USA, 2013.
[22] Christopher D. Manning, Hinrich Schütze, Foundations of Statistical Natural Language Processing, MIT Press, 1999.
[23] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space, In Proceedings of Workshop at ICLR, 2013.
[24] PAT Tree , http://www.openfoundry.org/of/projects/367/.
[25] Karl Pearson , http://en.wikipedia.org/wiki/Karl_Pearson.
[26] SRI Language Modeling ( SRILM ) , http://www.speech.sri.com/projects/srilm/.
[27] Stanford Part-Of-Speech Tagger, http://nlp.stanford.edu/software/tagger.shtml.
[28] Stanford Type Dependencies , http://nlp.stanford.edu/software/lex-parser.shtml.
[29] Stanford Word Segmenter , http://nlp.stanford.edu/software/segmenter.shtml.
[30] Lloyd N. Trefethen, David Bau, III. Numerical linear algebra, Siam, 1997.
[31] WEKA , http://www.cs.waikato.ac.nz/ml/weka/.
[32] Xiao-guang Wang, Mitsuyuki Inaba. Structure and evolution of digital humanities: empirical research based on correspondence and co-word analyses, 從保存到創造:開啟數位人文研究,97-112,臺北:國立臺灣大學出版中心,2011。
zh_TW