基於兩詞彙的序列關係建造非監督式 SeqWORDS 斷詞方法 | Publication

Publications-Theses

Article View/Open

pdf(194)

Publication Export

Google Scholar^TM

NCCU Library

Discovery System

Citation Infomation

Related Publications in TAIR

Simple Record
Full Record

題名	基於兩詞彙的序列關係建造非監督式 SeqWORDS 斷詞方法 SeqWORDS: an unsupervised Chinese segmentation method using relationship of two consecutive words.
作者	吳冠輝 Wu, Guan-Hui
貢獻者	薛慧敏 Hsueh, Huey-Miin 吳冠輝 Wu, Guan-Hui
關鍵詞	中文斷詞文本探勘動態規劃法文字詞典模型 EM演算法詞彙序列關係 Chinese texts mining Dynamic programming EM algorithm Word dictionary model Words dependency Word segmentation
日期	2019
上傳時間	1-Jul-2019 10:43:44 (UTC+8)
摘要	由於中文文本中的詞彙之間沒有任何標記或空格，所以斷詞被認為是中文文本探勘前必要且重要的預處理步驟。而目前中文斷詞方法多屬監督式方法，當沒有適當的詞典時難以發揮，例如針對新世代的文章或特定專業領域的文本。Deng等人在2016年提出非監督式斷詞方法TopWORDS，利用文字詞典模型(Word D ictionary Model, WDM)建構文本之概似函數，並且將斷詞資訊當作遺失變數，以EM演算法估計出各詞彙的使用機率，更利用動態規劃法(dynamic programm ing)計算，除了運算上相當具有效率，TopWORDS應用在許多文本上有良好的結果。然而，TopWORDS假設文本中每個位置的詞彙獨立且分配相同，這樣的假設恐怕忽略了詞彙在文意上的相連。此研究假設每個詞彙出現的概率與前一個詞彙有關，因此文本的概似函數可表示為兩詞彙的序列關係的函數，故將此研究提出的方法稱為「SeqWORDS」。在運用三種不同斷詞法於紅樓夢文本上後，我們觀察到 SeqWORDS雖然在探索新詞彙的能力較弱，然而當接續使用文本探勘工具如詞向量分析後發現，SeqWORDS 能提供最佳的解釋性。 Unlike alphabet-based language, there exists no space between words in Chinese corpus. The first step in Chinese text mining is to segment words in a sentence. Many existing segmentation methods are supervised in terms of requiring an adequate dictionary. However, Chinese language has developed so long and growing so fast. A suitable dictionary may not be available or easily accessed. In 2016, Deng et al. proposed an unsupervised method called “TopWORDS”, which needs no dictionary in hand. The authors derived the likelihood function of the corpus via word dictionary model (WDM). Further, they regard unknown segmentation information as missing data and utilize EM algorithm to estimate occurrence probability of words. To enhance computational efficiency, the estimates are computed by dynamic programming. In the article, the TopWORDS is found to perform well in several corpus. However, the iid assumption of TopWORDS ignores words dependency, which frequently occurs in consecutive words. Therefore, in this research we assume that a word’s occurrence depends on previous one and modify the TopWORDS method. By considering the sequential association of consecutive words, the proposed method is named “SeqWORDS”. The new method and two other existing methods are evaluated by their performance on the famous classical novel Story-of-Stone. We find that SeqWORDS is less capable to find new, rare words and is much time consuming. However, when we further implement some advance text mining analysis on the segmented corpus, the segmented corpus by SeqWORDS produces the most reasonable, interpretable results.
參考文獻	[1] The Stanford Natural Language. Processing Group, Chinese Natural Language Processing and Speech Processing. Retrieved May 24, 2019, from https://nlp.stanford.edu/projects/chinese-nlp.shtml#cws [2] J. Lafferty, A. McCallum, F. C.N. Pereira, (2001), Conditional random fields: Probabilistc models for segmenting and labeling sequence data. Proceedings of the 18th International Conference on Machine Learning 2001(ICML 2001), pp 282–289. [3] fxsjy, Jieba, Retrieved May 27, 2019, from https://github.com/fxsjy/ji eba [4] L. R. Rabiner, B. H. Juang, (1986), An introduction to hidden Markov models, IEEE ASSP MAGAZINE, vol 3, no 1, pp. 4-16. [5] A. Chen, (2003), Chinese word segmentation using minimal linguistic knowledge. Proceeding SIGHAN `03 Proceedings of the second SIGHAN workshop on Chinese language processing, Vol 17, pp 148–151. [6] K. J. Chen, S. H. Liu, (1992), Word identification for Mandarin Chinese sentences. Proceeding COLING `92 Proceedings of the 14th conference on Computational linguistics, Vol 1, pp 101–107. [7] K. Deng, P. K. Bol, K. J. Li, and J. S. Liu, (2016). On the unsupervised analysis of domain-specific Chinese texts. Proceedings of the National Academy of Sciences of the United States of America, vol 113, pp 6154–6159. [8] X. Ge, W. Pratt, P. Smyth, (1999), Discovering Chinese words from unsegmented text. Proceeding SIGIR `99 Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp 271–272. [9] A. P. Dempster, N. M. Laird, D. B. Rubin, (1977), Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B, vol 39, no 1, pp 1-38. [10] R. Bellman, (1954), The theory of dynamic programming, Bulletin of the American Mathematical Society, vol 60, no 6, pp 503-515. [11] X. Cao, Story-of-Stone. [12] 胡適，(1988)，胡適紅樓夢研究論述全編，上海古籍出版社。 [13] T. Mikolov, K. Chen, G. Corrado, J. Dean, (2013). Efficient Estimation of Word Representations in Vector Space, arXiv:1301.3781v3. [14] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, (2013). Distributed Representations of Words and Phrases and their Compositionality, NIPS 2013,3111-3119. [15] K. Pearson, (1901), On Lines and Planes of Closest Fit to Systems of Points in Space. Philosophical Magazine, vol 2, pp 559-572.
描述	碩士國立政治大學統計學系 106354027
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0106354027
資料類型	thesis

dc.contributor.advisor	薛慧敏	zh_TW
dc.contributor.advisor	Hsueh, Huey-Miin	en_US
dc.contributor.author (Authors)	吳冠輝	zh_TW
dc.contributor.author (Authors)	Wu, Guan-Hui	en_US
dc.creator (作者)	吳冠輝	zh_TW
dc.creator (作者)	Wu, Guan-Hui	en_US
dc.date (日期)	2019	en_US
dc.date.accessioned	1-Jul-2019 10:43:44 (UTC+8)	-
dc.date.available	1-Jul-2019 10:43:44 (UTC+8)	-
dc.date.issued (上傳時間)	1-Jul-2019 10:43:44 (UTC+8)	-
dc.identifier (Other Identifiers)	G0106354027	en_US
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/124122	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	統計學系	zh_TW
dc.description (描述)	106354027	zh_TW
dc.description.abstract (摘要)	由於中文文本中的詞彙之間沒有任何標記或空格，所以斷詞被認為是中文文本探勘前必要且重要的預處理步驟。而目前中文斷詞方法多屬監督式方法，當沒有適當的詞典時難以發揮，例如針對新世代的文章或特定專業領域的文本。Deng等人在2016年提出非監督式斷詞方法TopWORDS，利用文字詞典模型(Word D ictionary Model, WDM)建構文本之概似函數，並且將斷詞資訊當作遺失變數，以EM演算法估計出各詞彙的使用機率，更利用動態規劃法(dynamic programm ing)計算，除了運算上相當具有效率，TopWORDS應用在許多文本上有良好的結果。然而，TopWORDS假設文本中每個位置的詞彙獨立且分配相同，這樣的假設恐怕忽略了詞彙在文意上的相連。此研究假設每個詞彙出現的概率與前一個詞彙有關，因此文本的概似函數可表示為兩詞彙的序列關係的函數，故將此研究提出的方法稱為「SeqWORDS」。在運用三種不同斷詞法於紅樓夢文本上後，我們觀察到 SeqWORDS雖然在探索新詞彙的能力較弱，然而當接續使用文本探勘工具如詞向量分析後發現，SeqWORDS 能提供最佳的解釋性。	zh_TW
dc.description.abstract (摘要)	Unlike alphabet-based language, there exists no space between words in Chinese corpus. The first step in Chinese text mining is to segment words in a sentence. Many existing segmentation methods are supervised in terms of requiring an adequate dictionary. However, Chinese language has developed so long and growing so fast. A suitable dictionary may not be available or easily accessed. In 2016, Deng et al. proposed an unsupervised method called “TopWORDS”, which needs no dictionary in hand. The authors derived the likelihood function of the corpus via word dictionary model (WDM). Further, they regard unknown segmentation information as missing data and utilize EM algorithm to estimate occurrence probability of words. To enhance computational efficiency, the estimates are computed by dynamic programming. In the article, the TopWORDS is found to perform well in several corpus. However, the iid assumption of TopWORDS ignores words dependency, which frequently occurs in consecutive words. Therefore, in this research we assume that a word’s occurrence depends on previous one and modify the TopWORDS method. By considering the sequential association of consecutive words, the proposed method is named “SeqWORDS”. The new method and two other existing methods are evaluated by their performance on the famous classical novel Story-of-Stone. We find that SeqWORDS is less capable to find new, rare words and is much time consuming. However, when we further implement some advance text mining analysis on the segmented corpus, the segmented corpus by SeqWORDS produces the most reasonable, interpretable results.	en_US
dc.description.tableofcontents	第一章介紹 1 第二章方法 3 第三章實作 11 第四章結論 28 參考文獻 30 附錄一 33 附錄二 35 附錄三 38	zh_TW
dc.format.extent	2540097 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0106354027	en_US
dc.subject (關鍵詞)	中文斷詞	zh_TW
dc.subject (關鍵詞)	文本探勘	zh_TW
dc.subject (關鍵詞)	動態規劃法	zh_TW
dc.subject (關鍵詞)	文字詞典模型	zh_TW
dc.subject (關鍵詞)	EM演算法	zh_TW
dc.subject (關鍵詞)	詞彙序列關係	zh_TW
dc.subject (關鍵詞)	Chinese texts mining	en_US
dc.subject (關鍵詞)	Dynamic programming	en_US
dc.subject (關鍵詞)	EM algorithm	en_US
dc.subject (關鍵詞)	Word dictionary model	en_US
dc.subject (關鍵詞)	Words dependency	en_US
dc.subject (關鍵詞)	Word segmentation	en_US
dc.title (題名)	基於兩詞彙的序列關係建造非監督式 SeqWORDS 斷詞方法	zh_TW
dc.title (題名)	SeqWORDS: an unsupervised Chinese segmentation method using relationship of two consecutive words.	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	[1] The Stanford Natural Language. Processing Group, Chinese Natural Language Processing and Speech Processing. Retrieved May 24, 2019, from https://nlp.stanford.edu/projects/chinese-nlp.shtml#cws [2] J. Lafferty, A. McCallum, F. C.N. Pereira, (2001), Conditional random fields: Probabilistc models for segmenting and labeling sequence data. Proceedings of the 18th International Conference on Machine Learning 2001(ICML 2001), pp 282–289. [3] fxsjy, Jieba, Retrieved May 27, 2019, from https://github.com/fxsjy/ji eba [4] L. R. Rabiner, B. H. Juang, (1986), An introduction to hidden Markov models, IEEE ASSP MAGAZINE, vol 3, no 1, pp. 4-16. [5] A. Chen, (2003), Chinese word segmentation using minimal linguistic knowledge. Proceeding SIGHAN `03 Proceedings of the second SIGHAN workshop on Chinese language processing, Vol 17, pp 148–151. [6] K. J. Chen, S. H. Liu, (1992), Word identification for Mandarin Chinese sentences. Proceeding COLING `92 Proceedings of the 14th conference on Computational linguistics, Vol 1, pp 101–107. [7] K. Deng, P. K. Bol, K. J. Li, and J. S. Liu, (2016). On the unsupervised analysis of domain-specific Chinese texts. Proceedings of the National Academy of Sciences of the United States of America, vol 113, pp 6154–6159. [8] X. Ge, W. Pratt, P. Smyth, (1999), Discovering Chinese words from unsegmented text. Proceeding SIGIR `99 Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp 271–272. [9] A. P. Dempster, N. M. Laird, D. B. Rubin, (1977), Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B, vol 39, no 1, pp 1-38. [10] R. Bellman, (1954), The theory of dynamic programming, Bulletin of the American Mathematical Society, vol 60, no 6, pp 503-515. [11] X. Cao, Story-of-Stone. [12] 胡適，(1988)，胡適紅樓夢研究論述全編，上海古籍出版社。 [13] T. Mikolov, K. Chen, G. Corrado, J. Dean, (2013). Efficient Estimation of Word Representations in Vector Space, arXiv:1301.3781v3. [14] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, (2013). Distributed Representations of Words and Phrases and their Compositionality, NIPS 2013,3111-3119. [15] K. Pearson, (1901), On Lines and Planes of Closest Fit to Systems of Points in Space. Philosophical Magazine, vol 2, pp 559-572.	zh_TW
dc.identifier.doi (DOI)	10.6814/NCCU201900115	en_US

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM