dc.contributor.advisor | 劉吉軒 | zh_TW |
dc.contributor.advisor | Liu,Jyi Shane | en_US |
dc.contributor.author (Authors) | 黃群弼 | zh_TW |
dc.creator (作者) | 黃群弼 | zh_TW |
dc.date (日期) | 2008 | en_US |
dc.date.accessioned | 19-Sep-2009 12:10:04 (UTC+8) | - |
dc.date.available | 19-Sep-2009 12:10:04 (UTC+8) | - |
dc.date.issued (上傳時間) | 19-Sep-2009 12:10:04 (UTC+8) | - |
dc.identifier (Other Identifiers) | G0094971010 | en_US |
dc.identifier.uri (URI) | https://nccur.lib.nccu.edu.tw/handle/140.119/37106 | - |
dc.description (描述) | 碩士 | zh_TW |
dc.description (描述) | 國立政治大學 | zh_TW |
dc.description (描述) | 資訊科學學系 | zh_TW |
dc.description (描述) | 94971010 | zh_TW |
dc.description (描述) | 97 | zh_TW |
dc.description.abstract (摘要) | 中文繁簡在字體或電腦編碼上明顯不同之外,在部份詞彙的用法也有不同,這些用法不同的詞彙卻有相同意義的詞彙稱為繁簡體中的等義詞,這些等義詞在雙方文化交流時可能會造成一些障礙,例如人們互相對話、文件書籍翻譯或軟體系統等轉換時容易造成詞義上的誤解,目前均以人工方式來解決不同詞彙的問題,均會費時耗力且易疏漏,若能利用科學的方法讓電腦能自動辨識中文繁簡的等義詞,便能利用辨識出的等義詞給予提示,解決繁簡詞義不同所造成的誤解。依照實驗設計架構,首先建立電腦類與一般類的繁簡體語料庫,作為辨識的基礎,並建立研究的架構與方法,分為二個階段三種方法,第一階段使用第一種方法,我們先使用N-gram辨識等義詞,評估單一方法是否能有效辨識出等義詞,第二階段使用第二種方法PMI-IR & LC-IR方法與第三種方法Context Vector,評估第二階段的方法是否能將等義詞的辨識能力提高。根據本研究目的,讓電腦能自動在語料庫中自動辨識中文繁簡等義詞,所以提出了新的辨識架構,用N-gram初步辨識出等義詞,並經由PMI-IR & LC-IR與Context Vector方法提高Precision約0~20%不等。本研究結論是採用不同語言的語料庫,使用N-gram能夠辦識出等義詞,並搭配PMI-IR & LC-IR與Context Vector方法後,可以強化與提昇其等義詞辨識的能力,解決單一方法等義詞辨識能力不足之問題。 | zh_TW |
dc.description.abstract (摘要) | Traditional Chinese and Simplied Chinese are not only different in the typeface and in the computer code, but also in the partial usage of vocabularies. These vocabularies which have different usage but have the same significance are called synonyms. These synonyms will cause some obstacles and misunderstanding in meaning when two parties have cultural exchange, such as during conversation, documents and books translation or softwares system transformation. What we do to solve the problem now is picked them out by manpower, but that will waste a lot of time and strength and easily make errors. If we can use scientific way to let the computer distinguish automatically the synonyms between Traditional Chinese and Simplied Chinese, we will be able to solve such misunderstanding by the hints of the distinguished synonyms.According to the structure of experiment, to let the computer distinguish automatically the synonyms between Traditional Chinese and Simplied Chinese, we have to establish a Traditional Chinese and Simplied Chinese computer category and a general category first as the basis of identification. We should build up the research structure and the method, which divided into two stages and three methods. The first stage uses the first method to use N-gram to distinguish the synonyms and then review if this single method can identify the synonyms effectively. The second stage uses the second method PMI-IR & LC-IR and the third method Context Vector and review if the second stage can raise the synonyms’ ability of identification. According to this research purpose, the computer to study on automatic exact recognition synonyms between traditional and simplified Chinese, so has proposed the new structure of distinguishing, N-gram automatic exact recognition synonym tentatively, and PMI-IR & LC-IR and Context Vector method can improve Precision about 0~20%. This conclusion is a corpus base of using different languages, using N-gram can be exact recognition synonyms, PMI-IR & LC-IR and Context Vector method, can improve single method ability. | en_US |
dc.description.tableofcontents | 第 一 章 緒 論 131.1 簡介 131.2 研究背景與動機 141.3 研究方法 161.4 本論文的貢獻 171.5 研究範圍與限制 181.6 論文架構 19第 二 章 文獻探討 202.1 等義詞辨識的相關研究 202.1.1 即絕對等義詞和即相對等義詞 202.1.2 詞義辨識的演算法 222.1.3 中文詞義辨識技術 242.2 詞彙共現TERM CO-OCCURRENCE 262.3 N-GRAM(N連詞) 282.4 PMI-IR&LC-IR方法 302.4.1 PMI-IR(POINTWISE MUTUAL INFORMATION-INFORMATION RETRIEVAL) 302.4.2 LC-IR(LOCAL CONTEXT–INFORMATION RETRIEVAL) 322.5 CONTEXT VECTOR向量空間模型 332.6 小結 36第 三 章 研究繁簡等義詞辨識方法 373.1 研究架構 373.2 建立語料庫模組 393.2.1 建立電腦類繁簡體語料庫 403.2.2 建立一般類繁簡體語料庫 423.2.3 建立正確詞組 443.2.4 建立雜訊資料 463.2.5 虛詞STOP WORD 473.2.6 中文的內碼 483.2.7 繁簡體編碼的轉換 503.3 文字斷詞處理 523.3.1 繁體斷詞的處理 533.3.2 簡體斷詞的處理 563.3.3 標點符號的處理 583.4 建立N-GRAM模組 593.5 建立PMI-IR&LC-IR模組 623.6 建立CONTEXT VECTOR模組 643.7 小結 65第 四 章 實驗設計與分析 674.1 實驗語料庫來源 674.2 實驗設計 724.2.1 語料庫的斷詞: 724.2.2 N-GRAM將斷詞結果處理 754.2.3 篩選等義詞候選詞 814.2.4 PMI-IR&LC-IR處理二次篩選 834.2.5 CONTEXT VECTOR處理二次篩選 904.3 實驗評估方法 924.4 實驗分析 944.5 小結 106第 五 章 結論和未來方向 1085.1 研究結論 1085.2 未來研究建議 1095.3 未來研究方向 110第 六 章 參考文獻 112 | zh_TW |
dc.format.extent | 109225 bytes | - |
dc.format.extent | 135156 bytes | - |
dc.format.extent | 134939 bytes | - |
dc.format.extent | 155430 bytes | - |
dc.format.extent | 220231 bytes | - |
dc.format.extent | 380862 bytes | - |
dc.format.extent | 473214 bytes | - |
dc.format.extent | 886296 bytes | - |
dc.format.extent | 174851 bytes | - |
dc.format.extent | 158195 bytes | - |
dc.format.extent | 1065193 bytes | - |
dc.format.mimetype | application/pdf | - |
dc.format.mimetype | application/pdf | - |
dc.format.mimetype | application/pdf | - |
dc.format.mimetype | application/pdf | - |
dc.format.mimetype | application/pdf | - |
dc.format.mimetype | application/pdf | - |
dc.format.mimetype | application/pdf | - |
dc.format.mimetype | application/pdf | - |
dc.format.mimetype | application/pdf | - |
dc.format.mimetype | application/pdf | - |
dc.format.mimetype | application/pdf | - |
dc.language.iso | en_US | - |
dc.source.uri (資料來源) | http://thesis.lib.nccu.edu.tw/record/#G0094971010 | en_US |
dc.subject (關鍵詞) | 中文繁簡對照 | zh_TW |
dc.subject (關鍵詞) | 等義詞 | zh_TW |
dc.subject (關鍵詞) | 自動辨識 | zh_TW |
dc.title (題名) | 中文繁簡等義詞自動辨識之研究 | zh_TW |
dc.title (題名) | A Study on Automatic Recognition on Exact Synonyms between Traditional and Simplified Chinese | en_US |
dc.type (資料類型) | thesis | en |
dc.relation.reference (參考文獻) | 1. Amruta Purandare, & Ted Pedersen. (2004). Improving Word Sense Discrimination with Gloss Augmented Feature Vectors. Appears in the Proceedings of the Workshop on Lexical Resources for the Web and Word Sense Disambiguation. Puebla Mexico. | zh_TW |
dc.relation.reference (參考文獻) | 2. Attar, R., & Fraenkel, A. S. (1977). Local Feedback in Full-Text Retrieval Systems. Journal of the ACM, Volume 24, Issue 3, (頁 397-417). | zh_TW |
dc.relation.reference (參考文獻) | 3. Ben, Gabriel, & David. (2006). Dimensionality Reduction Aids Term Co-occurrence Based Multi-Document Summarization. | zh_TW |
dc.relation.reference (參考文獻) | 4. Brown, & Peter. (1991). Word sense disambiguation using statistical methods. In ACL 29, (pp. 264-270). | zh_TW |
dc.relation.reference (參考文獻) | 5. C. J. Van Rijsbergen. (1979). Information Retrieval. Butterworths, sec. edition., (pp 208). | zh_TW |
dc.relation.reference (參考文獻) | 6. Chen, Jen-Nan, & Chang, Jason-S. (1998). TopSense: A Topical Sense Clustering Method based on Information Retrieval Techniques on Machine Readable Resources. Special Issue on Word Sense Disambiguation, Computational Linguistics, (pp. 61-95). | zh_TW |
dc.relation.reference (參考文獻) | 7. Chen, Keh-Jiann, & You, Jia-Ming. (2002). A Study on Word Similarity using Context Vector Models. | zh_TW |
dc.relation.reference (參考文獻) | 8. Chen, Keh-Jiann, & You, Jia-Ming. (2006). Improving Context Vector Models by Feature Clustering for Automatic Thesaurus Construction.”. Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing. | zh_TW |
dc.relation.reference (參考文獻) | 9. David Hull. (1994). Improving Text Retrieval for the Routing Problem using Latent Semantic Indexing. ACM SIGIR Conference. | zh_TW |
dc.relation.reference (參考文獻) | 10. David Yarowsky. (1994). Decision Lists for Lexical Ambiguity Resolution: Application to Accent Restoration in Spanish and French. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics. Las Cruces, NM, (pp. 88-95). | zh_TW |
dc.relation.reference (參考文獻) | 11. Daniel Jurafsky, & James H. Martin. (2000). Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics. Prentice-Hall. | zh_TW |
dc.relation.reference (參考文獻) | 12. Dan Klein, & Christopher D. Manning. (2003). Accurate Unlexicalized Parsing. Proceedings of the 41st Meeting of the Association for Computational Linguistics., (pp. 423-430). | zh_TW |
dc.relation.reference (參考文獻) | 13. Derrick Higgins. (2004). Which statistics reflect semantics? Rethinking synonymy and word similarity. | zh_TW |
dc.relation.reference (參考文獻) | 14. Dong, Zhen-dong, & Dong, Qiang. (2006). Hownet and the Computation of Meaning. World Scientific. | zh_TW |
dc.relation.reference (參考文獻) | 15. Efron, B. (1979). Bootstrap methods: another look at the jackknife. Ann. Statist., (pp. 1-26). | zh_TW |
dc.relation.reference (參考文獻) | 16. G. Salton & MJ McGill. (1983). Introduction to modern information retrieval. New York: McGraw-Hill. | zh_TW |
dc.relation.reference (參考文獻) | 17. GAISWWW Query. 擷取自 http://gais.cs.ccu.edu.tw/ | zh_TW |
dc.relation.reference (參考文獻) | 18. Gale, William, Church, Kenneth, Yarowsky. (1992). A method of disambiguating word senses in a large corpus. Computers and the Humanties 26, (pp. 415-439). | zh_TW |
dc.relation.reference (參考文獻) | 19. Google Offers Immediate Access to 3 Billion Web Documents. (2001). 擷取自 Google Inc: http://www.google.com/press/pressrel/3billion.html | zh_TW |
dc.relation.reference (參考文獻) | 20. H. Edmund Stiles. (1961). The association factor in information retrieval. Journal of the ACM, 8, (pp. 271-279). | zh_TW |
dc.relation.reference (參考文獻) | 21. Helen J. Peat, & Peter Willett . (1991). The Limitations of Term Co-occurrence Data for Query Expansion in Document Retrieval Systems. | zh_TW |
dc.relation.reference (參考文獻) | 22. Howard D. White, Xia Lin, Jan W. Buzydlowski, & Chaomei Chen . (2001). Term Co-occurrence Analysis as an Interface for Digital Libraries. | zh_TW |
dc.relation.reference (參考文獻) | 23. Jarmasz, M., & Szpakowicz. S. (2003). Roget’s thesaurus and semantic similarity. University of Ottawa ms. | zh_TW |
dc.relation.reference (參考文獻) | 24. Joe A. Guthrie, Louise Guthrie, Yorick Wilks, & Homa Aidinejad. (1991). Subject-Dependent Co-occurrence and Word Sense Disambiguation. In Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, (pp. 146-152). | zh_TW |
dc.relation.reference (參考文獻) | 25. Le, Cuong-Anh, & Shimizu, Akira. (2004). High WSD Accuracy Using Naive Bayesian Classifier with Rich Features. PACLIC 18. Tokyo. | zh_TW |
dc.relation.reference (參考文獻) | 26. Lesk, M. E. (1969). Word-word associations in document retrieval systems. American Documentation, 20, (pp. 27-38). | zh_TW |
dc.relation.reference (參考文獻) | 27. Li, Xiaobin, Stan Szpakowicz, & Matwin. (1995). A WordNet-Based Algorithm for Word Semantic Sense Disambiguation. In Proceedings of the 14th International Joint Conference on Artificial Intelligence IJCAL-95,. Montreal, Canada. | zh_TW |
dc.relation.reference (參考文獻) | 28. Lin, De-kang. (1997). Using Syntactic Dependency as Local Context to Resolve Word Sense Ambiguity. In Proceedings of ACL-97. Madrid, Spain. | zh_TW |
dc.relation.reference (參考文獻) | 29. Lu, Wen-Hsiang, Lee, Hsi-Jian, & Chien, Lee-Feng. (2003). Term Translation Extraction Using Web Mining Techniques. | zh_TW |
dc.relation.reference (參考文獻) | 30. Magnus Sahlgren. (2006). Using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces. | zh_TW |
dc.relation.reference (參考文獻) | 31. Manning, Christopher, Schutze, & Hinrich. (1999). Foundations of Statistical Natural Language Processing. MIT Press. | zh_TW |
dc.relation.reference (參考文獻) | 32. Marco Baroni, & Sabrina Bisi. (2004). Using cooccurrence statistics and the web to discover synonyms in a technical language. | zh_TW |
dc.relation.reference (參考文獻) | 33. Mar´ıa Ruiz-Casado, Enrique Alfonseca, & Pablo Castells. (2005). Using context-window overlapping in synonym discovery and ontology extension. | zh_TW |
dc.relation.reference (參考文獻) | 34. M. E. Maron, & J. L. Kuhns. (1960). On relevance, probabilistic indexing and information retrieval. Journal of rhe ACM, 7, (pp. 216-244). | zh_TW |
dc.relation.reference (參考文獻) | 35. Michael.W. Berry, Susan.T. Dumais, & Amy.T. Shippy. (1995). A Case Study of Latent Semantic Indexing. Tech Rep., (pp. 95-271). | zh_TW |
dc.relation.reference (參考文獻) | 36. Michael Lesk . (1986). Automatic Sense Disambiguation: How to tell a pine cone from an ice cream cone. In Proceedings of the 1986 SIGDOC Conference, New York. Association for Computing Machinerypp. 24-26. | zh_TW |
dc.relation.reference (參考文獻) | 37. Siddharth Patwardhan, Satanjeev Banerjee, & Ted Pedersen. (2005). SenseRelate::TargetWord - A Generalized Framework for Word Sense Disambiguation. Appears in the Proceedings of the Twentieth National Conference on Artificial Intelligence. Pittsburgh, PA. | zh_TW |
dc.relation.reference (參考文獻) | 38. Peng, Fu-chun, Huang, Xiang-ji, Dale, Schuurmans,& Wang, Shao-jun. (2003). Text Classification in Asian Languages without Word Segmentation. Proceedings of the Sixth Internationa Workshop on Information Retrieval with Asian Languages (IRAL), Vol. 18, (pp. 41-48). | zh_TW |
dc.relation.reference (參考文獻) | 39. Philip Edmonds & Graeme Hirst. (2002). Near-synonymy and lexical choice. Computational Linguistics,28(2), (pp. 105-144). | zh_TW |
dc.relation.reference (參考文獻) | 40. Q.yuhen斷詞系統. 擷取自 http://www.rainsts.net | zh_TW |
dc.relation.reference (參考文獻) | 41. Senseval-2. (2001). 擷取自 http://193.133.140.102/senseval2/ | zh_TW |
dc.relation.reference (參考文獻) | 42. Sketch Engine. 擷取自 http://www.sketchengine.co.uk/ | zh_TW |
dc.relation.reference (參考文獻) | 43. Slator, B. (1991). Using Context for Sense Preference. In Zernik (ed.) Lexical Acquisition: Exploiting on-line Resources to Build a Lexicon, Lawrence Erlbaum, Hillsdale. | zh_TW |
dc.relation.reference (參考文獻) | 44. Soumen Chakrabarti, Martin van den Berg, & Byron Dom. (1999). Focused crawling: A new approach to Topic-Specific Web Resource Discovery. Proceedings of the WWW8 Conference. | zh_TW |
dc.relation.reference (參考文獻) | 45. Stanford Parser. 擷取自 http://www-nlp.stanford.edu/downloads/lex-parser.shtml | zh_TW |
dc.relation.reference (參考文獻) | 46. Stevens, M. E., Giuliano, V. E., & Heilprin, L. B. (1965). Statistical association methods for mechanized documentation. Washington:National Bureau of Standards (Occasional Publication no. 269). | zh_TW |
dc.relation.reference (參考文獻) | 47. Thomas K Landaauer, & Susan T. Dumais. (1997). A solution to Plato`s problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104(2), (pp. 211–240). | zh_TW |
dc.relation.reference (參考文獻) | 48. Turney, . (2001). Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. Proceedings of the Twelfth European Conference on Machine Learning (ECML2001), (pp. 491-502). Freiburg, Germany. | zh_TW |
dc.relation.reference (參考文獻) | 49. UngererF & Schmid. (1996). An Introduction to Cognitive Linguistics. London: Longman. | zh_TW |
dc.relation.reference (參考文獻) | 50. Walker. (1987). Thesaurus-Based Disambiguation. | zh_TW |
dc.relation.reference (參考文獻) | 51. Wang, Jenq-Haur, Teng, Jei-Wen, Cheng, Pu-Jen, Lu, Wen-Hsiang, & Chien, Lee-Feng (2004). Translating Unknown Cross-Lingual Queries in Digital Libraries Using a Web-based Approach. | zh_TW |
dc.relation.reference (參考文獻) | 52. William C. Hannas. (1997). Asia`s Orthographic Dilemma. University of Hawaii Press. | zh_TW |
dc.relation.reference (參考文獻) | 53. William, R. Caid, & Joel, L. Carleton. (2003). Context Vector-Based Text Retrieval. A Fair Isaac White Paper. | zh_TW |
dc.relation.reference (參考文獻) | 54. Yang, Chang-hua, & Sue, Jin-Ker. (2002). Considerations of Linking WordNet with MRD. In Proceedings of the 19th International Conference on Computational Linguistics, (pp. 1121-1127). | zh_TW |
dc.relation.reference (參考文獻) | 55. 中央研究院斷詞系統. 擷取自 http://rocling.iis.sinica.edu.tw/CKIP/wordsegment.htm | zh_TW |
dc.relation.reference (參考文獻) | 56. 中国知网. 擷取自 http://www.cnki.net/index.htm | zh_TW |
dc.relation.reference (參考文獻) | 57. 北京大學语言信息处理研究所. 擷取自 http://202.112.195.8/Down.asp | zh_TW |
dc.relation.reference (參考文獻) | 58. 全昌勤、何婷婷、姬東鴻與劉輝. (2005). 從搭配知識獲取最優種子的詞義消歧方法. 中文信息學報,第十九卷,第一期, (頁 30-37). | zh_TW |
dc.relation.reference (參考文獻) | 59. 朱邦復工作室. 中台港澳通用中文內碼之介紹 . 擷取自 http://www.cbflabs.com/tec/cbflabs/jason2k0914.htm | zh_TW |
dc.relation.reference (參考文獻) | 60. 車方翔、劉挺、秦兵與李生. (2003). 面向依存文法分析的搭配抽取方法研究. 哈爾濱工業大學信息檢索研究室論文集. | zh_TW |
dc.relation.reference (參考文獻) | 61. 知网. 擷取自 http://www.keenage.com/ | zh_TW |
dc.relation.reference (參考文獻) | 62. 俞士汶、朱學峰、王惠與張芸芸. (1998). 現代漢語語法信息辭典. 清華大學出版社. | zh_TW |
dc.relation.reference (參考文獻) | 63. 倚天. 倚天中文系統技術手冊. | zh_TW |
dc.relation.reference (參考文獻) | 64. 梅家駒、竺一鳴、高蘊琦與殷鴻翔. (1993). 同義詞詞林. 上海辭書出版社. | zh_TW |
dc.relation.reference (參考文獻) | 65. 搜狗实验室(Sogou Labs). 擷取自 http://www.sogou.com/labs/ | zh_TW |
dc.relation.reference (參考文獻) | 66. 維基百科. 擷取自 http://zh.wikipedia.org | zh_TW |
dc.relation.reference (參考文獻) | 67. 汤志祥. (2002). 汉语词汇的"借用"和"移用"及其深层社会意义. | zh_TW |
dc.relation.reference (參考文獻) | 68. 陈水仙. (2006). 港台地区词汇对普通话的影响. 广东外语外贸大学英语教育学院. | zh_TW |
dc.relation.reference (參考文獻) | 69. 陈钟、彭波、关宏飞與王继民. (2005). 一种词汇共现算法及共现词对检索系统排序的影响. | zh_TW |