dc.contributor | 資科系 | |
dc.creator (作者) | Wang, J.-P.;Liu, Chao-Lin | |
dc.creator (作者) | 劉昭麟 | zh_TW |
dc.date (日期) | 2012 | |
dc.date.accessioned | 10-Apr-2015 16:38:30 (UTC+8) | - |
dc.date.available | 10-Apr-2015 16:38:30 (UTC+8) | - |
dc.date.issued (上傳時間) | 10-Apr-2015 16:38:30 (UTC+8) | - |
dc.identifier.uri (URI) | http://nccur.lib.nccu.edu.tw/handle/140.119/74480 | - |
dc.description.abstract (摘要) | Instead of directly providing the service of Chinese segmentation, some open-source software allows us to train segmentation models with segmented text. The resulting models can perform quite well, if training data of high quality are available. In reality, it is not easy to obtain sufficient and excellent training data, unfortunately. We report an exploration of using parallel corpora and various lexicons with techniques of identifying unknown words and near synonyms to automatically generate training data for such open-source software. We achieved promising results of segmentation in current experiments. Although the results fell short of outperforming the well-known Chinese segmenters, we believe that the proposed approach offers a viable alternative for users of the open-source software to generate their own training data. | |
dc.format.extent | 176 bytes | - |
dc.format.mimetype | text/html | - |
dc.relation (關聯) | Proceedings of the 24th Conference on Computational Linguistics and Speech Processing, ROCLING 2012 | |
dc.subject (關鍵詞) | Chinese segmentation; High quality; Open-source softwares; Parallel corpora; Segmentation models; Training data; Computational linguistics; Speech processing | |
dc.title (題名) | Applications of parallel corpora for Chinese segmentation | |
dc.type (資料類型) | conference | en |