Using parallel corpora to automatically generate training data for Chinese segmenters in NTCIR PatentMT task | NCCU Academic Hub

Publications-Proceedings

Article View/Open

pdf(341)

Publication Export

Google Scholar^TM

NCCU Library

Discovery System

Citation Infomation

No doi shows Citation Infomation

Related Publications in TAIR

Simple Record
Full Record

題名	Using parallel corpora to automatically generate training data for Chinese segmenters in NTCIR PatentMT task
作者	Wang, Jui-Ping;Liu, Chao-Lin 劉昭麟
貢獻者	資科系
關鍵詞	Chinese-English Patent Machine Translation, Chinese Near Synonyms, Chinese Segmentation, Machine Learning
日期	2013-06
上傳時間	22-Jun-2016 17:19:26 (UTC+8)
摘要	Chinese texts do not contain spaces as word separators like English and many alphabetic languages. To use Moses to train translation models, we must segment Chinese texts into sequences of Chinese words. Increasingly more software tools for Chinese segmentation are populated on the Internet in recent years. However, some of these tools were trained with general texts, so might not handle domain-specific terms in patent documents very well. Some machine-learning based tools require us to provide segmented Chinese to train segmentation models. In both cases, providing segmented Chinese texts to refine a pre-trained model or to create a new model for segmentation is an important basis for successful Chinese-English machine translation systems. Ideally, high-quality segmented texts should be created and verified by domain experts, but doing so would be quite costly. We explored an approach to algorithmically generate segmented texts with parallel texts and lexical resources. Our scores in NTCIR-10 PatentMT indeed improved from our scores in NTCIR-9 PatentMT with the new approach.
關聯	Proceedings of NTCIR-10 (NTCIR 10), 368‒372. Tokyo, Japan, 18-21 June 2013
資料類型	conference

dc.contributor	資科系
dc.creator (作者)	Wang, Jui-Ping;Liu, Chao-Lin
dc.creator (作者)	劉昭麟	zh_TW
dc.date (日期)	2013-06
dc.date.accessioned	22-Jun-2016 17:19:26 (UTC+8)	-
dc.date.available	22-Jun-2016 17:19:26 (UTC+8)	-
dc.date.issued (上傳時間)	22-Jun-2016 17:19:26 (UTC+8)	-
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/98245	-
dc.description.abstract (摘要)	Chinese texts do not contain spaces as word separators like English and many alphabetic languages. To use Moses to train translation models, we must segment Chinese texts into sequences of Chinese words. Increasingly more software tools for Chinese segmentation are populated on the Internet in recent years. However, some of these tools were trained with general texts, so might not handle domain-specific terms in patent documents very well. Some machine-learning based tools require us to provide segmented Chinese to train segmentation models. In both cases, providing segmented Chinese texts to refine a pre-trained model or to create a new model for segmentation is an important basis for successful Chinese-English machine translation systems. Ideally, high-quality segmented texts should be created and verified by domain experts, but doing so would be quite costly. We explored an approach to algorithmically generate segmented texts with parallel texts and lexical resources. Our scores in NTCIR-10 PatentMT indeed improved from our scores in NTCIR-9 PatentMT with the new approach.
dc.format.extent	302675 bytes	-
dc.format.mimetype	application/pdf	-
dc.relation (關聯)	Proceedings of NTCIR-10 (NTCIR 10), 368‒372. Tokyo, Japan, 18-21 June 2013
dc.subject (關鍵詞)	Chinese-English Patent Machine Translation, Chinese Near Synonyms, Chinese Segmentation, Machine Learning
dc.title (題名)	Using parallel corpora to automatically generate training data for Chinese segmenters in NTCIR PatentMT task
dc.type (資料類型)	conference