學術產出-Theses

Article View/Open

Publication Export

Google ScholarTM

政大圖書館

Citation Infomation

  • No doi shows Citation Infomation
題名 應用平行語料建構中文斷詞組件
Applications of Parallel Corpora for Chinese Segmentation
作者 王瑞平
Wang, Jui Ping
貢獻者 劉昭麟
Liu, Chao Lin
王瑞平
Wang, Jui Ping
關鍵詞 中文斷詞
中英平行語料
未知詞
交集型歧異
日期 2011
上傳時間 30-Oct-2012 11:46:02 (UTC+8)
摘要 在本論文,我們建構一個基於中英平行語料的中文斷詞系統,並透過該系統對不同領域的語料斷詞。提供我們的系統不同領域的中英平行語料後,系統可以自動化地產生品質不錯的訓練語料,以節省透過人工斷詞方式取得訓練語料所耗費的時間、人力。
在產生訓練語料時,首先對中英平行語料中的所有中文句,透過查詢中文辭典的方式產生句子的各種斷詞組合,再利用英漢翻譯的資訊處理交集型歧異,將錯誤的斷詞組合去除。此外本研究從中英平行語料中擷取新的中英詞對與未知詞,並分別將其擴充至英漢辭典模組與中文辭典模組,以提升我們的系統之斷詞效能。
我們透過兩部分的實驗進行斷詞效能評估,而在實驗中會使用三種不同領域的實驗語料。在第一部分,我們以人工斷詞的測試語料進行斷詞效能評估。在第二部分,我們藉由漢英翻譯的翻譯品質間接地評估我們的系統之斷詞效能。由實驗結果顯示,我們的系統可以有一定的斷詞效能。
In this paper, we construct a Chinese word segmentation system which based on Chinese-English Parallel Corpus to save time and manpower, and the corpora in different domains can be segmented by our system.
By providing Chinese-English Parallel Corpus to our system, training corpus can be automatically produced by our system. Then segmentation model can be trained with the produced training corpus. We use Chinese translation of words in English parallel sentences to solve overlapping ambiguity. We extract translation pairs and unknown words from Chinese-English Parallel Corpus.
In evaluation, two different experiments are conducted, and experimental data in three domains are used to evaluate segmentation performance in two experiments. In the first experiment, manually annotated Chinese sentences are used as testing data. In the second experiment, segmentation performance is indirectly indicated by translation quality. Experimental results show that our system achieves acceptable segmentation performance.
參考文獻 [1] 牛津現代英漢雙解詞典,http://startdict.sourceforge.net/Dictionaries_zh_TW.php [連結已失效]。
[2] 中央研究院中文斷詞系統,http://ckipsvr.iis.sinica.edu.tw/ [2011/11/2]。
[3] 中央研究院現代漢語標記語料庫4.0版簡介,http://db1x.sinica.edu.tw/cgi-bin/kiwi/mkiwi/mkiwi.sh [2011/12/22]。
[4] 田侃文,英漢專利文書文句對列與應用,國立政治大學資訊科學所,碩士論文,2009。
[5] 史丹佛剖析器, http://nlp.stanford.edu/software/lex-parser.shtml [2012/2/26]。
[6] 朱怡霖,中文斷詞與專有名詞辨識之研究,國立臺灣大學資訊工程學研究所,碩士論文,2002。
[7] 成語詞典,http://yeelou.com/huzheng/stardict-dic/zh_TW/ [2011/3/30]。
[8] 林筱晴,語料庫統計值與網際網路統計值在自然語言處理上之應用:以中文斷詞為例,國立臺灣大學資訊工程學研究所,碩士論文,2004。
[9] 林千翔,基於特製隱藏式馬可夫模型之中文斷詞研究,國立中央大學資訊工程研究所,碩士論文,2006。
[10] 莊怡軒,英文技術文獻中動詞與其受詞之中文翻譯的語境效用,國立政治大學資訊科學所,碩士論文,2011。
[11] 現代漢語一詞泛讀,http://elearning.ling.sinica.edu.tw/introduction.html [2011/8/26]。
[12] 國家教育研究院學術名詞資訊網,http://terms.nict.gov.tw/download_main.php [2011/8/26]。
[13] 掌印辭典整理,http://www.palmstamp.com/forum/viewthread.php?tid=832&page=1#pid6847 [2011/8/26]。
[14] 詹嘉丞,中文斷詞系統中非繁體中文詞彙之處理,國立台灣海洋大學資訊工程所,碩士論文,2009。
[15] 構詞篇(下),http://chcs-opencourse.org/chcs/full_content/A21/pdf/03.pdf [2012/2/27]。
[16] 劉群、李素建,基於《知網》的辭彙語義相似度計算,中文計算語言學期刊,第七卷第二期,59-76,2002。
[17] 懶蟲簡明英漢詞典,http://yeelou.com/huzheng/stardict-dic/zh_TW/ [2011/3/30]。
[18] 羅永聖,結合多類型字典與條件隨機域之中文斷詞與詞性標記系統研究,國立臺灣大學資訊工程學研究所,碩士論文,2008。
[19] Keh-Jiann Chen and Shing-Huan Liu, Word Identification for Mandarin Chinese Sentences, Proceedings of the 15th International Conference on Computational Linguistics, 101-107, 1992.
[20] Keh-Jiann Chen and Ming-Hong Bai, Unknown Word Detection for Chinese by a Corpus-based Learning Method, International Journal of Computational linguistics and Chinese Language Processing, Vol. 3, Num. 1, 27-44, 1998.
[21] Keh-Jiann Chen and Wei-Yun Ma, Unknown Word Extraction for Chinese Documents, Proceedings of the 19th International Conference on Computational Linguistics, 169-175, 2002.
[22] Pi-Chuan Chang, Michel Galley, and Christopher D. Manning, Optimizing Chinese Word Segmentation for Machine Translation Performance, Proceedings of the 3rd Workshop on Statistical Machine Translation, 224-232, 2008.
[23] Dr.eye譯典通字典, http://www.dreye.com/ [2011/8/26].
[24] E-HowNet, http://ckip.iis.sinica.edu.tw/taxonomy/taxonomy-doc.htm [2011/8/26].
[25] E-HowNet Technical Report, http://rocling.iis.sinica.edu.tw/CKIP/paper/Technical_Reprt_E-HowNet.pdf [2012/6/21].
[26] Chung-Chi Huang, Wei-Teh Chen, and Jason S. Chang, Bilingual Segmentation for Alignment and Translation, Proceedings of the 9th international conference on Computational linguistics and intelligent text processing, 445-453, 2008.
[27] ICTCLAS漢語分詞系統, http://ictclas.org/ [2012/7/1].
[28] Wenbin Jiang, Liang Huang, Qun Liu, and Yajuan Lü, A Cascaded Linear Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging, Proceedings of 46th Annual Meeting on Association for Computational Linguistics: HLT, 897-904, 2008.
[29] Wenbin Jiang, Liang Huang, and Qun Liu, Automatic Adaptation of Annotation Standards:ChineseWord Segmentation and POS Tagging – A Case Study, Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 522-530, 2009.
[30] Mu Li, Jianfeng Gao, Changning Huang, and Jianfeng Li, Unsupervised Training for Overlapping Ambiguity Resolution in Chinese Word Segmentation, Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing, 1-7, 2003.
[31] LingPipe, http://alias-i.com/lingpipe/ [2011/8/26] .
[32] Yanjun Ma and Andy Way, Bilingually Motivated Domain-Adapted Word Segmentation for Statistical Machine Translation, Proceedings of the 12th Conference of the European Chapter of the ACL, 549-557, 2009.
[33] Moses, http://www.statmt.org/moses/ [2011/12/22].
[34] C. D. Manning and Hinrich Schütze, Foundations of Statistical Natural Language Processing, 1999, MIT Press.
[35] Pat-Tree 中文抽詞程式, http://www.openfoundry.org/of/projects/367/ [2012/3/16].
[36] Patent Machine Translation Task at the NTCIR-9, http://ntcir.nii.ac.jp/PatentMT/ [2012/3/11].
[37] SIGHAN Bakeoff 2, www.sighan.org/bakeoff2005/ [2011/12/22].
[38] Stanford Chinese Segmenter, http://nlp.stanford.edu/software/segmenter.shtml [2011/8/26].
[39] Yuen-Hsien Tseng, Chao-Lin Liu, Chia-Chi Tsai, Jui-Ping Wang, Yi-Hsuan Chuang, and James Jeng, Statistical approaches to patent translation - Experiments with various settings of training data, Proceedings of the 9th NTCIR Workshop Meeting on Evaluation of Information Access Technologies: Information Retrieval, Question Answering and Cross-Lingual Information Access - PatentMT, 661-665, 2011.
[40] Kun Wang, Chengqing Zong, and Keh-Yih Su, A Character-Based Joint Model for Chinese Word Segmentation, Proceedings of the 23th International Conference on Computational Linguistics, 1173-1181, 2010.
[41] Yahoo!斷章取義API, http://tw.developer.yahoo.com/cas/ [2011/11/2].
描述 碩士
國立政治大學
資訊科學學系
99753016
100
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0099753016
資料類型 thesis
dc.contributor.advisor 劉昭麟zh_TW
dc.contributor.advisor Liu, Chao Linen_US
dc.contributor.author (Authors) 王瑞平zh_TW
dc.contributor.author (Authors) Wang, Jui Pingen_US
dc.creator (作者) 王瑞平zh_TW
dc.creator (作者) Wang, Jui Pingen_US
dc.date (日期) 2011en_US
dc.date.accessioned 30-Oct-2012 11:46:02 (UTC+8)-
dc.date.available 30-Oct-2012 11:46:02 (UTC+8)-
dc.date.issued (上傳時間) 30-Oct-2012 11:46:02 (UTC+8)-
dc.identifier (Other Identifiers) G0099753016en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/54795-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊科學學系zh_TW
dc.description (描述) 99753016zh_TW
dc.description (描述) 100zh_TW
dc.description.abstract (摘要) 在本論文,我們建構一個基於中英平行語料的中文斷詞系統,並透過該系統對不同領域的語料斷詞。提供我們的系統不同領域的中英平行語料後,系統可以自動化地產生品質不錯的訓練語料,以節省透過人工斷詞方式取得訓練語料所耗費的時間、人力。
在產生訓練語料時,首先對中英平行語料中的所有中文句,透過查詢中文辭典的方式產生句子的各種斷詞組合,再利用英漢翻譯的資訊處理交集型歧異,將錯誤的斷詞組合去除。此外本研究從中英平行語料中擷取新的中英詞對與未知詞,並分別將其擴充至英漢辭典模組與中文辭典模組,以提升我們的系統之斷詞效能。
我們透過兩部分的實驗進行斷詞效能評估,而在實驗中會使用三種不同領域的實驗語料。在第一部分,我們以人工斷詞的測試語料進行斷詞效能評估。在第二部分,我們藉由漢英翻譯的翻譯品質間接地評估我們的系統之斷詞效能。由實驗結果顯示,我們的系統可以有一定的斷詞效能。
zh_TW
dc.description.abstract (摘要) In this paper, we construct a Chinese word segmentation system which based on Chinese-English Parallel Corpus to save time and manpower, and the corpora in different domains can be segmented by our system.
By providing Chinese-English Parallel Corpus to our system, training corpus can be automatically produced by our system. Then segmentation model can be trained with the produced training corpus. We use Chinese translation of words in English parallel sentences to solve overlapping ambiguity. We extract translation pairs and unknown words from Chinese-English Parallel Corpus.
In evaluation, two different experiments are conducted, and experimental data in three domains are used to evaluate segmentation performance in two experiments. In the first experiment, manually annotated Chinese sentences are used as testing data. In the second experiment, segmentation performance is indirectly indicated by translation quality. Experimental results show that our system achieves acceptable segmentation performance.
en_US
dc.description.tableofcontents 第一章 緒論 1
1.1 研究背景與動機 1
1.2 研究方法 2
1.3 論文架構 3
第二章 文獻探討 5
2.1 中文斷詞之相關研究 5
2.1.1 法則式斷詞法之相關研究 5
2.1.2 統計式斷詞法之相關研究 5
2.1.3 斷詞歧異性問題與未知詞問題之相關研究 8
2.1.4 斷詞標準不一問題之相關研究 8
2.2 基於英漢雙語平行語料進行斷詞的相關研究 9
第三章 系統架構 11
3.1 系統流程與架構 11
3.2 斷詞模型訓練工具 12
第四章 辭典模組介紹與加入近義詞 13
4.1 辭典模組介紹 13
4.2 加入近義詞之英漢合併辭典建置 14
4.2.1 利用一詞泛讀尋找近義詞 14
4.2.2 利用E-HowNet尋找近義詞 15
4.2.3 辭典建置流程 22
第五章 產生訓練語料 23
5.1 產生各種斷詞組合 23
5.2 利用英漢翻譯的資訊處理交集型歧異 26
5.3 擷取中英詞對與未知詞 28
5.3.1 擷取「候選中英遺留詞對」與「候選中文遺留字詞」 28
5.3.2 利用可能性比例與共現頻率進行篩選 29
5.3.3 利用詞性序列規則進行篩選 32
第六章 實驗結果與分析 36
6.1 實驗語料來源 36
6.2 擷取中英詞對與未知詞之實驗 38
6.2.1 擷取中英詞對之實驗 38
6.2.2 擷取未知詞之實驗 41
6.3 以人工斷詞測試語料評估斷詞效能之實驗 45
6.3.1 實驗流程設計 46
6.3.2 實驗結果與分析 49
6.4 以漢英翻譯的翻譯品質評估斷詞效能之實驗 54
6.4.1 實驗流程設計 55
6.4.2 實驗結果與分析 57
第七章 結論與未來展望 62
7.1 結論 62
7.2 未來展望 63
參考文獻 65
附錄Ι 不同領域語料之斷詞效能(以詞數表示) 70
附錄Ⅱ 口試問題與建議之記錄 72
zh_TW
dc.language.iso en_US-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0099753016en_US
dc.subject (關鍵詞) 中文斷詞zh_TW
dc.subject (關鍵詞) 中英平行語料zh_TW
dc.subject (關鍵詞) 未知詞zh_TW
dc.subject (關鍵詞) 交集型歧異zh_TW
dc.title (題名) 應用平行語料建構中文斷詞組件zh_TW
dc.title (題名) Applications of Parallel Corpora for Chinese Segmentationen_US
dc.type (資料類型) thesisen
dc.relation.reference (參考文獻) [1] 牛津現代英漢雙解詞典,http://startdict.sourceforge.net/Dictionaries_zh_TW.php [連結已失效]。
[2] 中央研究院中文斷詞系統,http://ckipsvr.iis.sinica.edu.tw/ [2011/11/2]。
[3] 中央研究院現代漢語標記語料庫4.0版簡介,http://db1x.sinica.edu.tw/cgi-bin/kiwi/mkiwi/mkiwi.sh [2011/12/22]。
[4] 田侃文,英漢專利文書文句對列與應用,國立政治大學資訊科學所,碩士論文,2009。
[5] 史丹佛剖析器, http://nlp.stanford.edu/software/lex-parser.shtml [2012/2/26]。
[6] 朱怡霖,中文斷詞與專有名詞辨識之研究,國立臺灣大學資訊工程學研究所,碩士論文,2002。
[7] 成語詞典,http://yeelou.com/huzheng/stardict-dic/zh_TW/ [2011/3/30]。
[8] 林筱晴,語料庫統計值與網際網路統計值在自然語言處理上之應用:以中文斷詞為例,國立臺灣大學資訊工程學研究所,碩士論文,2004。
[9] 林千翔,基於特製隱藏式馬可夫模型之中文斷詞研究,國立中央大學資訊工程研究所,碩士論文,2006。
[10] 莊怡軒,英文技術文獻中動詞與其受詞之中文翻譯的語境效用,國立政治大學資訊科學所,碩士論文,2011。
[11] 現代漢語一詞泛讀,http://elearning.ling.sinica.edu.tw/introduction.html [2011/8/26]。
[12] 國家教育研究院學術名詞資訊網,http://terms.nict.gov.tw/download_main.php [2011/8/26]。
[13] 掌印辭典整理,http://www.palmstamp.com/forum/viewthread.php?tid=832&page=1#pid6847 [2011/8/26]。
[14] 詹嘉丞,中文斷詞系統中非繁體中文詞彙之處理,國立台灣海洋大學資訊工程所,碩士論文,2009。
[15] 構詞篇(下),http://chcs-opencourse.org/chcs/full_content/A21/pdf/03.pdf [2012/2/27]。
[16] 劉群、李素建,基於《知網》的辭彙語義相似度計算,中文計算語言學期刊,第七卷第二期,59-76,2002。
[17] 懶蟲簡明英漢詞典,http://yeelou.com/huzheng/stardict-dic/zh_TW/ [2011/3/30]。
[18] 羅永聖,結合多類型字典與條件隨機域之中文斷詞與詞性標記系統研究,國立臺灣大學資訊工程學研究所,碩士論文,2008。
[19] Keh-Jiann Chen and Shing-Huan Liu, Word Identification for Mandarin Chinese Sentences, Proceedings of the 15th International Conference on Computational Linguistics, 101-107, 1992.
[20] Keh-Jiann Chen and Ming-Hong Bai, Unknown Word Detection for Chinese by a Corpus-based Learning Method, International Journal of Computational linguistics and Chinese Language Processing, Vol. 3, Num. 1, 27-44, 1998.
[21] Keh-Jiann Chen and Wei-Yun Ma, Unknown Word Extraction for Chinese Documents, Proceedings of the 19th International Conference on Computational Linguistics, 169-175, 2002.
[22] Pi-Chuan Chang, Michel Galley, and Christopher D. Manning, Optimizing Chinese Word Segmentation for Machine Translation Performance, Proceedings of the 3rd Workshop on Statistical Machine Translation, 224-232, 2008.
[23] Dr.eye譯典通字典, http://www.dreye.com/ [2011/8/26].
[24] E-HowNet, http://ckip.iis.sinica.edu.tw/taxonomy/taxonomy-doc.htm [2011/8/26].
[25] E-HowNet Technical Report, http://rocling.iis.sinica.edu.tw/CKIP/paper/Technical_Reprt_E-HowNet.pdf [2012/6/21].
[26] Chung-Chi Huang, Wei-Teh Chen, and Jason S. Chang, Bilingual Segmentation for Alignment and Translation, Proceedings of the 9th international conference on Computational linguistics and intelligent text processing, 445-453, 2008.
[27] ICTCLAS漢語分詞系統, http://ictclas.org/ [2012/7/1].
[28] Wenbin Jiang, Liang Huang, Qun Liu, and Yajuan Lü, A Cascaded Linear Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging, Proceedings of 46th Annual Meeting on Association for Computational Linguistics: HLT, 897-904, 2008.
[29] Wenbin Jiang, Liang Huang, and Qun Liu, Automatic Adaptation of Annotation Standards:ChineseWord Segmentation and POS Tagging – A Case Study, Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 522-530, 2009.
[30] Mu Li, Jianfeng Gao, Changning Huang, and Jianfeng Li, Unsupervised Training for Overlapping Ambiguity Resolution in Chinese Word Segmentation, Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing, 1-7, 2003.
[31] LingPipe, http://alias-i.com/lingpipe/ [2011/8/26] .
[32] Yanjun Ma and Andy Way, Bilingually Motivated Domain-Adapted Word Segmentation for Statistical Machine Translation, Proceedings of the 12th Conference of the European Chapter of the ACL, 549-557, 2009.
[33] Moses, http://www.statmt.org/moses/ [2011/12/22].
[34] C. D. Manning and Hinrich Schütze, Foundations of Statistical Natural Language Processing, 1999, MIT Press.
[35] Pat-Tree 中文抽詞程式, http://www.openfoundry.org/of/projects/367/ [2012/3/16].
[36] Patent Machine Translation Task at the NTCIR-9, http://ntcir.nii.ac.jp/PatentMT/ [2012/3/11].
[37] SIGHAN Bakeoff 2, www.sighan.org/bakeoff2005/ [2011/12/22].
[38] Stanford Chinese Segmenter, http://nlp.stanford.edu/software/segmenter.shtml [2011/8/26].
[39] Yuen-Hsien Tseng, Chao-Lin Liu, Chia-Chi Tsai, Jui-Ping Wang, Yi-Hsuan Chuang, and James Jeng, Statistical approaches to patent translation - Experiments with various settings of training data, Proceedings of the 9th NTCIR Workshop Meeting on Evaluation of Information Access Technologies: Information Retrieval, Question Answering and Cross-Lingual Information Access - PatentMT, 661-665, 2011.
[40] Kun Wang, Chengqing Zong, and Keh-Yih Su, A Character-Based Joint Model for Chinese Word Segmentation, Proceedings of the 23th International Conference on Computational Linguistics, 1173-1181, 2010.
[41] Yahoo!斷章取義API, http://tw.developer.yahoo.com/cas/ [2011/11/2].
zh_TW