學術產出-Proceedings

Article View/Open

Publication Export

Google ScholarTM

政大圖書館

Citation Infomation

  • No doi shows Citation Infomation
題名 Applications of parallel corpora for Chinese segmentation
作者 Wang, J.-P.;Liu, Chao-Lin
劉昭麟
貢獻者 資科系
關鍵詞 Chinese segmentation; High quality; Open-source softwares; Parallel corpora; Segmentation models; Training data; Computational linguistics; Speech processing
日期 2012
上傳時間 10-Apr-2015 16:38:30 (UTC+8)
摘要 Instead of directly providing the service of Chinese segmentation, some open-source software allows us to train segmentation models with segmented text. The resulting models can perform quite well, if training data of high quality are available. In reality, it is not easy to obtain sufficient and excellent training data, unfortunately. We report an exploration of using parallel corpora and various lexicons with techniques of identifying unknown words and near synonyms to automatically generate training data for such open-source software. We achieved promising results of segmentation in current experiments. Although the results fell short of outperforming the well-known Chinese segmenters, we believe that the proposed approach offers a viable alternative for users of the open-source software to generate their own training data.
關聯 Proceedings of the 24th Conference on Computational Linguistics and Speech Processing, ROCLING 2012
資料類型 conference
dc.contributor 資科系
dc.creator (作者) Wang, J.-P.;Liu, Chao-Lin
dc.creator (作者) 劉昭麟zh_TW
dc.date (日期) 2012
dc.date.accessioned 10-Apr-2015 16:38:30 (UTC+8)-
dc.date.available 10-Apr-2015 16:38:30 (UTC+8)-
dc.date.issued (上傳時間) 10-Apr-2015 16:38:30 (UTC+8)-
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/74480-
dc.description.abstract (摘要) Instead of directly providing the service of Chinese segmentation, some open-source software allows us to train segmentation models with segmented text. The resulting models can perform quite well, if training data of high quality are available. In reality, it is not easy to obtain sufficient and excellent training data, unfortunately. We report an exploration of using parallel corpora and various lexicons with techniques of identifying unknown words and near synonyms to automatically generate training data for such open-source software. We achieved promising results of segmentation in current experiments. Although the results fell short of outperforming the well-known Chinese segmenters, we believe that the proposed approach offers a viable alternative for users of the open-source software to generate their own training data.
dc.format.extent 176 bytes-
dc.format.mimetype text/html-
dc.relation (關聯) Proceedings of the 24th Conference on Computational Linguistics and Speech Processing, ROCLING 2012
dc.subject (關鍵詞) Chinese segmentation; High quality; Open-source softwares; Parallel corpora; Segmentation models; Training data; Computational linguistics; Speech processing
dc.title (題名) Applications of parallel corpora for Chinese segmentation
dc.type (資料類型) conferenceen