關鍵詞與階層式詞彙文本分群之應用

Publications-Theses

Article View/Open

pdf(131)

Publication Export

Google Scholar^TM

NCCU Library

Discovery System

Citation Infomation

Related Publications in TAIR

Simple Record
Full Record

題名	關鍵詞與階層式詞彙文本分群之應用 The Application of Key Words and Hierarchical Vocabulary Text Grouping
作者	黃培軒 Huang, Pei-Hsuan
貢獻者	余清祥<br>宋皇志 Yue, Ching-Syang<br>Sung, Huang-Chin 黃培軒 Huang, Pei-Hsuan
關鍵詞	階層式詞彙文本分群關鍵詞數位人文語意分析資料導向 Hierarchical vocabulary text grouping Keywords Digital humanities Semantic analysis Data driven
日期	2018
上傳時間	27-Jul-2018 11:33:35 (UTC+8)
摘要	文本為人類歷史足跡的載體，從朝代歷史至個人日記，記錄著當代人類的文化思想、風俗民情與科技發展，隨著時代推演這些紀錄不再侷限於牛皮紙張或土瓦竹簡等實體載具，以更多元的數位型式記載在網路虛擬世界。而文本往往必須委由專家才能解讀出其中心思想，隨著文字分析技術的興起，愈來愈多學者研發藉由量化技術找出文字蘊含的意義，以因應資訊氾濫時代中快速篩選資訊，提供專家以外另一種角度的解讀。主題式分析是文字分析的重要研究議題，透過界定關鍵詞與區隔文本屬性使得文本解析更為精確及有效率，本文以常用的TF-IDF (term frequency inverse document frequency)與處理語意的常見工具詞網(WordNet)為基礎，提出核心詞彙與篩選標籤特徵應用，探討因文章長短所造成的不穩定性與特殊領域詞彙關係問題(Magnini and Cavaglia, 2000)。本文利用《臺灣社會科學引文索引》(TSSCI)、美國專利、《人民日報》等三個文本作為分析對象，建構該文本的語意關係與相關之應用。分析發現TSSCI與美國專利的文本的分類準確率近八成，但若文本篇數過少時會因為雜訊太強無法呈現語意關係；而文本標籤(Label)間若是風格寫作上的差異，本文提出的主題分類無法歸類出較準確的分類結果，這可能也是《人民日報》文本分類準確率不佳的原因，但仍能透過該標籤的特徵(Feature)了解該時期的特殊主題。 Text is the carrier of the human history. From the official history to the personal diary, it records the culture, thoughts, customs, and technological developments of human beings. With the progress of computer technology, text recordings are no longer restricted to physical vehicles, such as kraft paper or earthen bamboo slips, and they can be recorded in various digital forms. With the rise of interest in quantifying text analysis, more and more scholars are dedicated in the technologic development of text analysis and apply them to explore the text meaning. Many people think that computer technology, such as machine learning and artificial intelligence, can help us relax the burden of human experts in seeking the meaning under the text. Topic analysis is an important research topic in text analysis. It makes text parsing faster by defining keywords and separating text attributes. This paper proposes the application of core vocabulary and screening tag features based on the commonly used TF-IDF (term frequency inverse document frequency) and the common tool word network (WordNet). We will apply them in exploring the relationship between instability caused by the length of the article and vocabulary (Magnini and Cavaglia, 2000). We use the Taiwan Social Science Citation Index (TSSCI), the U.S. patent, and the People`s Daily as the study materials. The results of text analysis show that the classification accuracies of TSSCI and U.S. patent texts are nearly 80%. However, if the number of article is too small, then the noise will distort the analysis and semantic relations. Also, we found the style writing would influence the accuracy of topic classification, which may be the reason why the People’s Daily text classification accuracy is not good.
參考文獻	何立行、余清祥、鄭文惠(2014)，從文言到白話:《新青年》雜誌語言變化統計研究，東亞觀念史集刊，第七期，頁427-454。余清祥(1998)，統計在紅樓夢的應用，政大學報，第七十六期，頁303-327。吳旻璁(2013)，結合主題資訊萃取關鍵詞和建構概念圖，碩士論文，國立雲林科技大學，資訊管理研究所。吳怡瑾、方友杉、喻欣凱(2009)，運用文件分群與概念關聯分析技術協助網誌瀏覽:任務導向評估方法，圖書資訊學研究，第四期第一卷，頁133-164。梁家安(2016)，從國共內戰到改革開放:人民日報風格變遷之量化研究，碩士論文，國立政治大學，統計研究所。謝博行(2013)，局部最長連續共同子序列與新詞組收集，碩士論文，國立清華大學，統計學研究所 Beliga, S., Meštrović, A., Martinčić-Ipšić, S.(2015). An overview of graph-based keyword extraction methods and approaches. Journal of information and organizational sciences, 39(1), 1-20. Benezeth, Y., Bertaux, A. Manceau, A.(2015). Bag-of-word based brand recognition using Markov clustering algorithm for codebook generation. 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, 3315-3318. Chen, C.H.(2017). Improved TF.IDF in Big News Retrieval: An Empirical Study. Pattern Recognition Letters, 93, 113 - 122. Condon, A., Karp, R. M.(2001). Algorithms for graph partitioning on the planted partition model. Random Structures and Algorithms, 18(2):116–140. Donetti, L., Munoz,M. A.(2004). Detecting network communities: a new systematic and efficient algorithm. Journal of Statistical Mechanics, 2004(10):10012. Girvan, M., Newman, M. E. J.(2002), Community structure in social and biological networks. Proc. Natl Acad. Sci. USA 99, 7821-7826 Hotho, A., Staab, S., Stumme, G.(2003). Wordnet improves text document clustering. In Proc. of the SIGIR 2003 Semantic Web Workshop, pages 541–544. Huang, A.(2008). Similarity Measures for Text Document Clustering, NZCSRSC 2008, Christchurch, New Zealand. Inmon, W. H., Nesavich, A.(2008). Tapping Into Unstructured Data-Integrating Unstructured Data and Textual Analytics into Business Intelligence, Prentice Hall. Lan, M., Tan, C.L., Low, H.B., Sung S.Y.(2005). A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. In Proc. 14th WWW, 1032–1033. Magnini, B. and Cavaglia, G.(2000). Integrating subject field codes into wordnet. In Proceedings of LREC-2000, the Second International Conference on Language Resources and Evaluation. Athens, Greece. Michael W., Berry, (2004). Survey of Text Mining – Clustering, Classification, and Retrieval. Springer Press Newman, M. E. J.(2004), Fast algorithm for detecting community structure in networks. Physical Review E, 69(6):066133. Passos A. and Wainer J.(2009) Wordnet-based metrics do not seem to help document clustering. Pons, P., Latapy, M(2006)., Computing communities in large networks using random walks. Journal of Graph Algorithms Applications, 10(2). Recupero, D. R.(2007). A new unsupervised method for document clustering by using WordNet lexical and conceptual relations. Information Retrieval, 10(6), 563– 579. Salton, G., Yu, C. T.(1975). On the construction of effective vocabularies for information retrieval[J]. ACM Sigplan Notices, 9(3), 48-60. Shraddha K. P., Pramod B. D.(2017). Vishakha A. M., Hierarchical document clustering based on cosine similarity measure.
描述	碩士國立政治大學統計學系 105354027
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0105354027
資料類型	thesis

dc.contributor.advisor	余清祥<br>宋皇志	zh_TW
dc.contributor.advisor	Yue, Ching-Syang<br>Sung, Huang-Chin	en_US
dc.contributor.author (Authors)	黃培軒	zh_TW
dc.contributor.author (Authors)	Huang, Pei-Hsuan	en_US
dc.creator (作者)	黃培軒	zh_TW
dc.creator (作者)	Huang, Pei-Hsuan	en_US
dc.date (日期)	2018	en_US
dc.date.accessioned	27-Jul-2018 11:33:35 (UTC+8)	-
dc.date.available	27-Jul-2018 11:33:35 (UTC+8)	-
dc.date.issued (上傳時間)	27-Jul-2018 11:33:35 (UTC+8)	-
dc.identifier (Other Identifiers)	G0105354027	en_US
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/118935	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	統計學系	zh_TW
dc.description (描述)	105354027	zh_TW
dc.description.abstract (摘要)	文本為人類歷史足跡的載體，從朝代歷史至個人日記，記錄著當代人類的文化思想、風俗民情與科技發展，隨著時代推演這些紀錄不再侷限於牛皮紙張或土瓦竹簡等實體載具，以更多元的數位型式記載在網路虛擬世界。而文本往往必須委由專家才能解讀出其中心思想，隨著文字分析技術的興起，愈來愈多學者研發藉由量化技術找出文字蘊含的意義，以因應資訊氾濫時代中快速篩選資訊，提供專家以外另一種角度的解讀。主題式分析是文字分析的重要研究議題，透過界定關鍵詞與區隔文本屬性使得文本解析更為精確及有效率，本文以常用的TF-IDF (term frequency inverse document frequency)與處理語意的常見工具詞網(WordNet)為基礎，提出核心詞彙與篩選標籤特徵應用，探討因文章長短所造成的不穩定性與特殊領域詞彙關係問題(Magnini and Cavaglia, 2000)。本文利用《臺灣社會科學引文索引》(TSSCI)、美國專利、《人民日報》等三個文本作為分析對象，建構該文本的語意關係與相關之應用。分析發現TSSCI與美國專利的文本的分類準確率近八成，但若文本篇數過少時會因為雜訊太強無法呈現語意關係；而文本標籤(Label)間若是風格寫作上的差異，本文提出的主題分類無法歸類出較準確的分類結果，這可能也是《人民日報》文本分類準確率不佳的原因，但仍能透過該標籤的特徵(Feature)了解該時期的特殊主題。	zh_TW
dc.description.abstract (摘要)	Text is the carrier of the human history. From the official history to the personal diary, it records the culture, thoughts, customs, and technological developments of human beings. With the progress of computer technology, text recordings are no longer restricted to physical vehicles, such as kraft paper or earthen bamboo slips, and they can be recorded in various digital forms. With the rise of interest in quantifying text analysis, more and more scholars are dedicated in the technologic development of text analysis and apply them to explore the text meaning. Many people think that computer technology, such as machine learning and artificial intelligence, can help us relax the burden of human experts in seeking the meaning under the text. Topic analysis is an important research topic in text analysis. It makes text parsing faster by defining keywords and separating text attributes. This paper proposes the application of core vocabulary and screening tag features based on the commonly used TF-IDF (term frequency inverse document frequency) and the common tool word network (WordNet). We will apply them in exploring the relationship between instability caused by the length of the article and vocabulary (Magnini and Cavaglia, 2000). We use the Taiwan Social Science Citation Index (TSSCI), the U.S. patent, and the People`s Daily as the study materials. The results of text analysis show that the classification accuracies of TSSCI and U.S. patent texts are nearly 80%. However, if the number of article is too small, then the noise will distort the analysis and semantic relations. Also, we found the style writing would influence the accuracy of topic classification, which may be the reason why the People’s Daily text classification accuracy is not good.	en_US
dc.description.tableofcontents	第一章緒論 1 第一節研究動機 1 第二節研究目的 2 第二章文獻探討 5 第一節結巴斷詞與詞幹化 5 第二節 TF-IDF 6 第三節詞網 7 第四節隨機遊走模型與社群網絡 8 第三章研究方法 12 第一節資料庫建立 12 第二節文本關鍵詞篩選 13 第三節社群網絡分群與命名 17 第四節文本歸類與標籤特徵 18 第四章資料介紹 20 第一節臺灣社會科學引文索引 20 第二節美國專利 20 第三節人民日報 22 第五章分析結果 24 第一節 TSSCI關鍵詞篩選閥值與模型 24 第二節階層式詞彙文本分群實證分析 29 5.2.1 階層式詞彙分群與命名 29 5.2.2 文本歸類準確性與特徵的選取 31 第六章結論與建議 35 第一節結論 35 第二節研究限制與未來建議 36 參考文獻 38 中文部分 38 英文部分 39 附錄一各文本關鍵詞及其分群 41 附錄二人民日報相關附文 49	zh_TW
dc.format.extent	1928532 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0105354027	en_US
dc.subject (關鍵詞)	階層式詞彙文本分群	zh_TW
dc.subject (關鍵詞)	關鍵詞	zh_TW
dc.subject (關鍵詞)	數位人文	zh_TW
dc.subject (關鍵詞)	語意分析	zh_TW
dc.subject (關鍵詞)	資料導向	zh_TW
dc.subject (關鍵詞)	Hierarchical vocabulary text grouping	en_US
dc.subject (關鍵詞)	Keywords	en_US
dc.subject (關鍵詞)	Digital humanities	en_US
dc.subject (關鍵詞)	Semantic analysis	en_US
dc.subject (關鍵詞)	Data driven	en_US
dc.title (題名)	關鍵詞與階層式詞彙文本分群之應用	zh_TW
dc.title (題名)	The Application of Key Words and Hierarchical Vocabulary Text Grouping	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	何立行、余清祥、鄭文惠(2014)，從文言到白話:《新青年》雜誌語言變化統計研究，東亞觀念史集刊，第七期，頁427-454。余清祥(1998)，統計在紅樓夢的應用，政大學報，第七十六期，頁303-327。吳旻璁(2013)，結合主題資訊萃取關鍵詞和建構概念圖，碩士論文，國立雲林科技大學，資訊管理研究所。吳怡瑾、方友杉、喻欣凱(2009)，運用文件分群與概念關聯分析技術協助網誌瀏覽:任務導向評估方法，圖書資訊學研究，第四期第一卷，頁133-164。梁家安(2016)，從國共內戰到改革開放:人民日報風格變遷之量化研究，碩士論文，國立政治大學，統計研究所。謝博行(2013)，局部最長連續共同子序列與新詞組收集，碩士論文，國立清華大學，統計學研究所 Beliga, S., Meštrović, A., Martinčić-Ipšić, S.(2015). An overview of graph-based keyword extraction methods and approaches. Journal of information and organizational sciences, 39(1), 1-20. Benezeth, Y., Bertaux, A. Manceau, A.(2015). Bag-of-word based brand recognition using Markov clustering algorithm for codebook generation. 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, 3315-3318. Chen, C.H.(2017). Improved TF.IDF in Big News Retrieval: An Empirical Study. Pattern Recognition Letters, 93, 113 - 122. Condon, A., Karp, R. M.(2001). Algorithms for graph partitioning on the planted partition model. Random Structures and Algorithms, 18(2):116–140. Donetti, L., Munoz,M. A.(2004). Detecting network communities: a new systematic and efficient algorithm. Journal of Statistical Mechanics, 2004(10):10012. Girvan, M., Newman, M. E. J.(2002), Community structure in social and biological networks. Proc. Natl Acad. Sci. USA 99, 7821-7826 Hotho, A., Staab, S., Stumme, G.(2003). Wordnet improves text document clustering. In Proc. of the SIGIR 2003 Semantic Web Workshop, pages 541–544. Huang, A.(2008). Similarity Measures for Text Document Clustering, NZCSRSC 2008, Christchurch, New Zealand. Inmon, W. H., Nesavich, A.(2008). Tapping Into Unstructured Data-Integrating Unstructured Data and Textual Analytics into Business Intelligence, Prentice Hall. Lan, M., Tan, C.L., Low, H.B., Sung S.Y.(2005). A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. In Proc. 14th WWW, 1032–1033. Magnini, B. and Cavaglia, G.(2000). Integrating subject field codes into wordnet. In Proceedings of LREC-2000, the Second International Conference on Language Resources and Evaluation. Athens, Greece. Michael W., Berry, (2004). Survey of Text Mining – Clustering, Classification, and Retrieval. Springer Press Newman, M. E. J.(2004), Fast algorithm for detecting community structure in networks. Physical Review E, 69(6):066133. Passos A. and Wainer J.(2009) Wordnet-based metrics do not seem to help document clustering. Pons, P., Latapy, M(2006)., Computing communities in large networks using random walks. Journal of Graph Algorithms Applications, 10(2). Recupero, D. R.(2007). A new unsupervised method for document clustering by using WordNet lexical and conceptual relations. Information Retrieval, 10(6), 563– 579. Salton, G., Yu, C. T.(1975). On the construction of effective vocabularies for information retrieval[J]. ACM Sigplan Notices, 9(3), 48-60. Shraddha K. P., Pramod B. D.(2017). Vishakha A. M., Hierarchical document clustering based on cosine similarity measure.	zh_TW
dc.identifier.doi (DOI)	10.6814/THE.NCCU.STAT.011.2018.B03	-

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM