Publications-Theses
Article View/Open
Publication Export
-
題名 基於語境特徵及分群模型之中文多義詞消歧
Using contextual information in clustering Chinese word senses作者 周子皓
Chou, Tzu Hao貢獻者 劉昭麟<br>賴惠玲
Liu, Chao Lin<br>Lai, Huei Lling
周子皓
Chou, Tzu Hao關鍵詞 多義詞
一詞多義
同形異義
分群
詞向量
句向量
Lexical ambiguity
Polysemy
Homonym
Clustering
Word vector
Sentence vector日期 2019 上傳時間 3-Oct-2019 17:17:45 (UTC+8) 摘要 多義詞為語言中常見的現象,如英語中的‘bank’,既可表示「銀行」又可表示「河岸」;‘bass’,既可表示「鱸魚」又可表示「電吉他」,而在中文中「黃牛」,既可表示「普通的牛」又可表示「非法仲介人」。而在目前,對於多義詞義項的了解主要透過辭典以及檢索系統,但是,時常仍會有不足的情況,對於辭典,一般收錄較規範化的使用方式以及無法時刻更新。因此對於詞彙較新穎的義項以及較口語的使用方式,辭典並不一定包含;此外對於檢索系統,以中央研究院平衡語料庫檢索系統為例,此系統會將目標詞彙的相關句提供使用者,但是,對於多義詞的義項,使用者必須閱讀所有的相關句後才能得知,其在語料庫中的義項。同時,目前多義詞研究中,人文學者需逐一檢視所擷取出的相關句,並根據人工進行判讀,才能將相關句依據義項進行分群。因此在本研究中,透過使用者提供之少量參考句,並且依據purity值選取最優之分群模型以及參數設置,透過此分群模型尋找語料庫中更多與參考句相同義項之相關句,並且依據目標詞彙之義項作為分群之依據,減少人文學者逐一判讀相關句所需之時間。同時,研究中為了觀察是否會因多義詞的類型不同而致使分群的效果以及embedding的結果會有所不同,因此於同形異義(homonym)選取「亞馬遜」、「蘋果」、「小米」、「火箭」、「東西」,作為研究對象;一詞多義(polysemy) 選取「出入」、「出發」、「壓力」、「溫暖」、「東西」,作為研究對象。
Lexical ambiguityis a common language phenomenon. In English, the word bank can refer to the bank which we save our money or a river bank. In Chinese, the term cattle(黃牛) can stand for either a cattle or a scalper.Currently the understanding of lexical ambiguity terms come from either the dictionary or a search system. However, there are often times where a dictionary or a search system is not enough. Dictionaries have a standard procedure for including content and once the dictionary has been published it cannot be updated frequently. Therefore, dictionaries can fail to include new definitions or verbal usage. For search systems, using the Academia Sinica’s database as an example, users are required to read through all related sentences to understand related meanings. Current research on lexical ambiguity requires researchers to examine sentences, extract term meanings and cluster them one by one.In this study, the best clustering model and variables are selected based on purity values derived from references provided by the user. Then, the selected clustering model is used to find more terms and references with similar meanings from the database. Finally, the terms will be clustered according to selected meanings.This study also observes whether different types of lexical ambiguity will affect the results of clustering and embedding. Therefore, this study chooses homonym such as amazon and apple, polysemy’s such as departure and pressure as research subjects. This study hopes to reduce the time needed for researchers to examine sentences, extract term meanings and cluster them one by one in lexical ambiguity researches.參考文獻 一. 中文部分[1] 中文維基百科。2007。中文維基百科。檢自:zhwiki-latest-pages-articles.xml.bz2。[2] 肖航。2011。教材語料詞義分佈量化考察。第十二屆漢語詞彙語義學研討會。[3] 吳美嫺。2010。《長阿含經》雙音詞研究。碩士論文。國立東華大學,花蓮縣,臺灣。[4] 林育增。2016。繁體版 Jieba。檢自:https://github.com/ldkrsi/jieba-zh_TW。[5] 林香薇。2016。閩南語歌仔冊中的多義詞「落 loh8」。師大學報,第 61 卷,第2 期,1-28。[6] 許尤芬。2012。中文多義詞「發」之語義探討:以語料庫為本。碩士論文。臺北市立教育大學,臺北市,臺灣。[7] 蔡宛玲。2016。漢語多義詞「跑」之結構及語意分析。碩士論文。國立政治大學,臺北市,臺灣。[8] 賴惠玲。2017。語意學(初版)。臺北:五南。二. 英文部分[9] David Arthur and Sergei Vassilvitskii. 2007. K-means++:The Advantages of Careful Seeding. In Proceedings of the 18th annual ACM-SIAM symposium on Discrete algorithms . SIAM, Philadelphia, PA, USA, 1027-1035.[10]Pavel Berkhin. 2006. A Survey of Clustering Data Mining Techniques. Springer,Berlin, Heidelberg, 25-71.[11]Yiu-Ming Cheung. 2003. K*-Means:A New Generalized K-means Clustering Algorithm. Pattern Recognition Letters, Volume 24, Issue 15. ELSEVIER,Amsterdam, Nederland, 2883-2893.[12]Wilm Donath and Alan Hoffman. 1973. Lower Bounds for the Partitioning of Graphs. IBM Journal of Research and Development, Volume 17, Issue 5. IBM,Amonk, NY, USA, 420-425.[13]Miroslav Fiedler. 1973. Algebraic Connectivity of Graphs. CzechoslovakMathematical Journal, Volume 23. Matematický ústav, Nové Město, Česko, 298-305.[14]Leonard Kaufman and Peter Rousseeuw. 1990. Finding Groups in Data: AnIntroduction to Cluster Analysis. Wiley, New York, NY, USA.[15]Shao-Hang Kao and Zhao-Ming Gao. 2007. Feature Selections in Word SenseDisambiguation. In Proceedings of the 19th Conference on ComputationalLinguistics and Speech Processing. ACLCLP, Taipei, Taiwan, 131-144.[16]Cuong Anh Le and Akira Shimazu. 2004. High WSD Accuracy Using NaïveBayesian Classifier with Rich Features. In Proceedings of the 18th Pacific AsiaConference on Language, Information and Computation. LLSJ, Tokyo, Japan,104-114.[17]Quoc Le and Tomas Mikolov. 2014. Distributed Representations of Sentences andDocuments. In Proceedings of the 31st International Conference on InternationalConference on Machine Learning, Volume 32. JMLR, USA, 1188-1196.[18]Michael Lesk. 1986. Automatic Sense Disambiguation Using Machine ReadableDictionaries:How to Tell a Pine Cone from an Ice Cream Cone. In Proceedingsof the 5th Annual Conference on Systems Documentation. ACM, New York, NY, USA, 24–26.[19]John Lyons. 1977. Semantics. Cambridg. Cambridge University Press.[20]Wei-Yun Ma and Keh-Jiann Chen. 2003. Introduction to CKIP Chinese WordSegmentation System for the First International Chinese Word SegmentationBakeoff. In Proceedings of the 2nd SIGHAN Workshop on Chinese LanguageProcessing, Volume 17. ACL, Stroudsburg, PA, USA, 168-171.[21]James MacQueen. 1967. Some Methods for Classification and Analysis ofMultivariate Observations. In Proceedings of the 5th Berkeley Symposium onMathematical Statistics and Probability, Volume 1. University of California Press,Oakland, CA, USA, 281-297.[22]Christopher Manning, Prabhakar Raghavan and Hinrich Schütze. 2009. AnIntroduction to Information Retrieval. Cambridge University Press, Cambridge,Cambs, England.[23]Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado and Jeffrey Dean. 2013.Distributed Representations of Words and Phrases and Their Compositionality. InProceedings of the 26th International Conference on Neural InformationProcessing Systems, Volume 2. Curran Associates, Red Hook, NY, USA, 3111-3119.[24]Roberto Navigli. 2009. Word Sense Disambiguation:A Survey. ACM ComputingSurveys, Volume 41, Issue 2. ACM, New York, NY, USA, 1-69.[25]Andrew Ng, Michael Jordan, and Yair Weiss. 2001. On Spectral Clustering Analysisand an Algorithm. In Proceedings of the 14th International Conference on NeuralInformation Processing Systems. MIT Press, Cambridge, MA, USA, 849-856.[26]Alessandro Raganato, Jose Camacho-Collados and Roberto Navigli. 2017.WordSense Disambiguation : A Unified Evaluation Framework and EmpiricalComparison. In Proceedings of the 15th Conference of the European Chapter ofthe Association for Computational Linguistics, Volume 1. ACL, Valencia, Spain, 99-110.[27]Peter Rousseeuw. 1987. Silhouettes:A Graphical Aid to the Interpretation andValidation of Cluster Analysis. Computational and Applied Mathematics, Volume20. ELSEVIER, Amsterdam, Nederland, 53-56.[28]Jianbo Shi and Jitendra Malik. 2000. Normalized Cuts and Image Segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 22,Issue 8. IEEE, Piscataway, NJ, USA, 888-905.[29]Eve Sweetser. 1986. Polysemy vs. Abstraction : Mutually Exclusive orComplementary? In Proceedings of the 12th Annual Meeting of the BerkeleyLinguistics Society. BLS, Berkeley, CA, USA, 528-538.[30]OpenCC, https://github.com/BYVoid/OpenCC.[31]WikiExtractor, https://github.com/attardi/wikiextractor.[32]Tian Zhang, Raghu Ramakrishnan and Miron Livny. 1996. BIRCH clustering:AnEfficient Data Clustering Method for Very Large Databases. In Proceedings of the 1996 Association for Computing Machinery`s Special Interest Group on Management of Data. ACM, New York, NY, USA, 103-114. 描述 碩士
國立政治大學
資訊科學系
104753029資料來源 http://thesis.lib.nccu.edu.tw/record/#G0104753029 資料類型 thesis dc.contributor.advisor 劉昭麟<br>賴惠玲 zh_TW dc.contributor.advisor Liu, Chao Lin<br>Lai, Huei Lling en_US dc.contributor.author (Authors) 周子皓 zh_TW dc.contributor.author (Authors) Chou, Tzu Hao en_US dc.creator (作者) 周子皓 zh_TW dc.creator (作者) Chou, Tzu Hao en_US dc.date (日期) 2019 en_US dc.date.accessioned 3-Oct-2019 17:17:45 (UTC+8) - dc.date.available 3-Oct-2019 17:17:45 (UTC+8) - dc.date.issued (上傳時間) 3-Oct-2019 17:17:45 (UTC+8) - dc.identifier (Other Identifiers) G0104753029 en_US dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/126580 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 資訊科學系 zh_TW dc.description (描述) 104753029 zh_TW dc.description.abstract (摘要) 多義詞為語言中常見的現象,如英語中的‘bank’,既可表示「銀行」又可表示「河岸」;‘bass’,既可表示「鱸魚」又可表示「電吉他」,而在中文中「黃牛」,既可表示「普通的牛」又可表示「非法仲介人」。而在目前,對於多義詞義項的了解主要透過辭典以及檢索系統,但是,時常仍會有不足的情況,對於辭典,一般收錄較規範化的使用方式以及無法時刻更新。因此對於詞彙較新穎的義項以及較口語的使用方式,辭典並不一定包含;此外對於檢索系統,以中央研究院平衡語料庫檢索系統為例,此系統會將目標詞彙的相關句提供使用者,但是,對於多義詞的義項,使用者必須閱讀所有的相關句後才能得知,其在語料庫中的義項。同時,目前多義詞研究中,人文學者需逐一檢視所擷取出的相關句,並根據人工進行判讀,才能將相關句依據義項進行分群。因此在本研究中,透過使用者提供之少量參考句,並且依據purity值選取最優之分群模型以及參數設置,透過此分群模型尋找語料庫中更多與參考句相同義項之相關句,並且依據目標詞彙之義項作為分群之依據,減少人文學者逐一判讀相關句所需之時間。同時,研究中為了觀察是否會因多義詞的類型不同而致使分群的效果以及embedding的結果會有所不同,因此於同形異義(homonym)選取「亞馬遜」、「蘋果」、「小米」、「火箭」、「東西」,作為研究對象;一詞多義(polysemy) 選取「出入」、「出發」、「壓力」、「溫暖」、「東西」,作為研究對象。 zh_TW dc.description.abstract (摘要) Lexical ambiguityis a common language phenomenon. In English, the word bank can refer to the bank which we save our money or a river bank. In Chinese, the term cattle(黃牛) can stand for either a cattle or a scalper.Currently the understanding of lexical ambiguity terms come from either the dictionary or a search system. However, there are often times where a dictionary or a search system is not enough. Dictionaries have a standard procedure for including content and once the dictionary has been published it cannot be updated frequently. Therefore, dictionaries can fail to include new definitions or verbal usage. For search systems, using the Academia Sinica’s database as an example, users are required to read through all related sentences to understand related meanings. Current research on lexical ambiguity requires researchers to examine sentences, extract term meanings and cluster them one by one.In this study, the best clustering model and variables are selected based on purity values derived from references provided by the user. Then, the selected clustering model is used to find more terms and references with similar meanings from the database. Finally, the terms will be clustered according to selected meanings.This study also observes whether different types of lexical ambiguity will affect the results of clustering and embedding. Therefore, this study chooses homonym such as amazon and apple, polysemy’s such as departure and pressure as research subjects. This study hopes to reduce the time needed for researchers to examine sentences, extract term meanings and cluster them one by one in lexical ambiguity researches. en_US dc.description.tableofcontents 第一章 緒論 11.1 研究動機 11.2 研究目的 21.3 主要貢獻 21.4 論文架構 3第二章 相關文獻及相關方法 42.1 多義詞 42.2 實驗方法相關研究 72.2.1 Embedding技術 82.2.2 分群技術 82.3 評估方法 11第三章 研究方法 153.1 系統流程 153.2 實驗語料 153.2.1 維基百科內容 163.2.2 新聞語料 173.2.3 參考句 183.3 維基百科內容前處理 193.3.1 WikiExtractor擷取本文 193.3.2 中文簡繁轉換 203.3.3 移除剩餘標籤與去除空白 213.3.4 斷句 223.3.5 維基百科內容基本數據統計 233.3.6 斷詞 233.3.7 斷詞器基本數據比較 253.4 擷取相關句 293.4.1 增加相關句語境 293.5 建立K-means分群模型 313.6 分群模型之評估方式 333.7 擷取代表句及評估擷取效果 34第四章 實驗設計與結果分析 404.1 目標詞彙 404.1.1 目標詞彙相關句數量 424.1.2 中文維基百科探討 444.1.3 實驗中目標詞彙 454.1.4 目標詞彙義項比例 464.2 建立分群模型 474.3 群模型評估 494.3.1 亞馬遜 504.3.2 出入 534.3.3 蘋果 564.3.4 出發 594.3.5 壓力 624.3.6 溫暖 664.3.7 小米 714.3.8 東西 754.3.9 火箭 794.4 評估擷取效果 814.4.1 亞馬遜 824.4.2 出入 854.4.3 蘋果 884.4.4 出發 914.4.5 壓力 944.4.6 溫暖 974.4.7 小米 1004.4.8 東西 1034.4.9 火箭 1064.5 擷取代表句 1094.5.1 亞馬遜 1104.5.2 出入 1154.5.3 蘋果 1184.5.4 出發 1204.5.5 壓力 1224.5.6 溫暖 1254.5.7 小米 1274.5.8 東西 1314.5.9 火箭 1354.6 綜合比較 137第五章 結論與未來展望 1415.1 結論 1415.2 未來展望 141參考文獻 143附錄一 論文口試相關討論 146附錄二 論文口試相關實驗 148 zh_TW dc.format.extent 5382562 bytes - dc.format.mimetype application/pdf - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0104753029 en_US dc.subject (關鍵詞) 多義詞 zh_TW dc.subject (關鍵詞) 一詞多義 zh_TW dc.subject (關鍵詞) 同形異義 zh_TW dc.subject (關鍵詞) 分群 zh_TW dc.subject (關鍵詞) 詞向量 zh_TW dc.subject (關鍵詞) 句向量 zh_TW dc.subject (關鍵詞) Lexical ambiguity en_US dc.subject (關鍵詞) Polysemy en_US dc.subject (關鍵詞) Homonym en_US dc.subject (關鍵詞) Clustering en_US dc.subject (關鍵詞) Word vector en_US dc.subject (關鍵詞) Sentence vector en_US dc.title (題名) 基於語境特徵及分群模型之中文多義詞消歧 zh_TW dc.title (題名) Using contextual information in clustering Chinese word senses en_US dc.type (資料類型) thesis en_US dc.relation.reference (參考文獻) 一. 中文部分[1] 中文維基百科。2007。中文維基百科。檢自:zhwiki-latest-pages-articles.xml.bz2。[2] 肖航。2011。教材語料詞義分佈量化考察。第十二屆漢語詞彙語義學研討會。[3] 吳美嫺。2010。《長阿含經》雙音詞研究。碩士論文。國立東華大學,花蓮縣,臺灣。[4] 林育增。2016。繁體版 Jieba。檢自:https://github.com/ldkrsi/jieba-zh_TW。[5] 林香薇。2016。閩南語歌仔冊中的多義詞「落 loh8」。師大學報,第 61 卷,第2 期,1-28。[6] 許尤芬。2012。中文多義詞「發」之語義探討:以語料庫為本。碩士論文。臺北市立教育大學,臺北市,臺灣。[7] 蔡宛玲。2016。漢語多義詞「跑」之結構及語意分析。碩士論文。國立政治大學,臺北市,臺灣。[8] 賴惠玲。2017。語意學(初版)。臺北:五南。二. 英文部分[9] David Arthur and Sergei Vassilvitskii. 2007. K-means++:The Advantages of Careful Seeding. In Proceedings of the 18th annual ACM-SIAM symposium on Discrete algorithms . SIAM, Philadelphia, PA, USA, 1027-1035.[10]Pavel Berkhin. 2006. A Survey of Clustering Data Mining Techniques. Springer,Berlin, Heidelberg, 25-71.[11]Yiu-Ming Cheung. 2003. K*-Means:A New Generalized K-means Clustering Algorithm. Pattern Recognition Letters, Volume 24, Issue 15. ELSEVIER,Amsterdam, Nederland, 2883-2893.[12]Wilm Donath and Alan Hoffman. 1973. Lower Bounds for the Partitioning of Graphs. IBM Journal of Research and Development, Volume 17, Issue 5. IBM,Amonk, NY, USA, 420-425.[13]Miroslav Fiedler. 1973. Algebraic Connectivity of Graphs. CzechoslovakMathematical Journal, Volume 23. Matematický ústav, Nové Město, Česko, 298-305.[14]Leonard Kaufman and Peter Rousseeuw. 1990. Finding Groups in Data: AnIntroduction to Cluster Analysis. Wiley, New York, NY, USA.[15]Shao-Hang Kao and Zhao-Ming Gao. 2007. Feature Selections in Word SenseDisambiguation. In Proceedings of the 19th Conference on ComputationalLinguistics and Speech Processing. ACLCLP, Taipei, Taiwan, 131-144.[16]Cuong Anh Le and Akira Shimazu. 2004. High WSD Accuracy Using NaïveBayesian Classifier with Rich Features. In Proceedings of the 18th Pacific AsiaConference on Language, Information and Computation. LLSJ, Tokyo, Japan,104-114.[17]Quoc Le and Tomas Mikolov. 2014. Distributed Representations of Sentences andDocuments. In Proceedings of the 31st International Conference on InternationalConference on Machine Learning, Volume 32. JMLR, USA, 1188-1196.[18]Michael Lesk. 1986. Automatic Sense Disambiguation Using Machine ReadableDictionaries:How to Tell a Pine Cone from an Ice Cream Cone. In Proceedingsof the 5th Annual Conference on Systems Documentation. ACM, New York, NY, USA, 24–26.[19]John Lyons. 1977. Semantics. Cambridg. Cambridge University Press.[20]Wei-Yun Ma and Keh-Jiann Chen. 2003. Introduction to CKIP Chinese WordSegmentation System for the First International Chinese Word SegmentationBakeoff. In Proceedings of the 2nd SIGHAN Workshop on Chinese LanguageProcessing, Volume 17. ACL, Stroudsburg, PA, USA, 168-171.[21]James MacQueen. 1967. Some Methods for Classification and Analysis ofMultivariate Observations. In Proceedings of the 5th Berkeley Symposium onMathematical Statistics and Probability, Volume 1. University of California Press,Oakland, CA, USA, 281-297.[22]Christopher Manning, Prabhakar Raghavan and Hinrich Schütze. 2009. AnIntroduction to Information Retrieval. Cambridge University Press, Cambridge,Cambs, England.[23]Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado and Jeffrey Dean. 2013.Distributed Representations of Words and Phrases and Their Compositionality. InProceedings of the 26th International Conference on Neural InformationProcessing Systems, Volume 2. Curran Associates, Red Hook, NY, USA, 3111-3119.[24]Roberto Navigli. 2009. Word Sense Disambiguation:A Survey. ACM ComputingSurveys, Volume 41, Issue 2. ACM, New York, NY, USA, 1-69.[25]Andrew Ng, Michael Jordan, and Yair Weiss. 2001. On Spectral Clustering Analysisand an Algorithm. In Proceedings of the 14th International Conference on NeuralInformation Processing Systems. MIT Press, Cambridge, MA, USA, 849-856.[26]Alessandro Raganato, Jose Camacho-Collados and Roberto Navigli. 2017.WordSense Disambiguation : A Unified Evaluation Framework and EmpiricalComparison. In Proceedings of the 15th Conference of the European Chapter ofthe Association for Computational Linguistics, Volume 1. ACL, Valencia, Spain, 99-110.[27]Peter Rousseeuw. 1987. Silhouettes:A Graphical Aid to the Interpretation andValidation of Cluster Analysis. Computational and Applied Mathematics, Volume20. ELSEVIER, Amsterdam, Nederland, 53-56.[28]Jianbo Shi and Jitendra Malik. 2000. Normalized Cuts and Image Segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 22,Issue 8. IEEE, Piscataway, NJ, USA, 888-905.[29]Eve Sweetser. 1986. Polysemy vs. Abstraction : Mutually Exclusive orComplementary? In Proceedings of the 12th Annual Meeting of the BerkeleyLinguistics Society. BLS, Berkeley, CA, USA, 528-538.[30]OpenCC, https://github.com/BYVoid/OpenCC.[31]WikiExtractor, https://github.com/attardi/wikiextractor.[32]Tian Zhang, Raghu Ramakrishnan and Miron Livny. 1996. BIRCH clustering:AnEfficient Data Clustering Method for Very Large Databases. In Proceedings of the 1996 Association for Computing Machinery`s Special Interest Group on Management of Data. ACM, New York, NY, USA, 103-114. zh_TW dc.identifier.doi (DOI) 10.6814/NCCU201901187 en_US