Applications of parallel corpora for Chinese segmentation | NCCU Academic Hub

學術產出-Proceedings

Article View/Open

html(1073)

Publication Export

Google Scholar^TM

政大圖書館

學術資源探索系統

Citation Infomation

No doi shows Citation Infomation

Simple Record
Full Record

題名	Applications of parallel corpora for Chinese segmentation
作者	Wang, J.-P.;Liu, Chao-Lin 劉昭麟
貢獻者	資科系
關鍵詞	Chinese segmentation; High quality; Open-source softwares; Parallel corpora; Segmentation models; Training data; Computational linguistics; Speech processing
日期	2012
上傳時間	10-Apr-2015 16:38:30 (UTC+8)
摘要	Instead of directly providing the service of Chinese segmentation, some open-source software allows us to train segmentation models with segmented text. The resulting models can perform quite well, if training data of high quality are available. In reality, it is not easy to obtain sufficient and excellent training data, unfortunately. We report an exploration of using parallel corpora and various lexicons with techniques of identifying unknown words and near synonyms to automatically generate training data for such open-source software. We achieved promising results of segmentation in current experiments. Although the results fell short of outperforming the well-known Chinese segmenters, we believe that the proposed approach offers a viable alternative for users of the open-source software to generate their own training data.
關聯	Proceedings of the 24th Conference on Computational Linguistics and Speech Processing, ROCLING 2012
資料類型	conference

dc.contributor	資科系
dc.creator (作者)	Wang, J.-P.;Liu, Chao-Lin
dc.creator (作者)	劉昭麟	zh_TW
dc.date (日期)	2012
dc.date.accessioned	10-Apr-2015 16:38:30 (UTC+8)	-
dc.date.available	10-Apr-2015 16:38:30 (UTC+8)	-
dc.date.issued (上傳時間)	10-Apr-2015 16:38:30 (UTC+8)	-
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/74480	-
dc.description.abstract (摘要)	Instead of directly providing the service of Chinese segmentation, some open-source software allows us to train segmentation models with segmented text. The resulting models can perform quite well, if training data of high quality are available. In reality, it is not easy to obtain sufficient and excellent training data, unfortunately. We report an exploration of using parallel corpora and various lexicons with techniques of identifying unknown words and near synonyms to automatically generate training data for such open-source software. We achieved promising results of segmentation in current experiments. Although the results fell short of outperforming the well-known Chinese segmenters, we believe that the proposed approach offers a viable alternative for users of the open-source software to generate their own training data.
dc.format.extent	176 bytes	-
dc.format.mimetype	text/html	-
dc.relation (關聯)	Proceedings of the 24th Conference on Computational Linguistics and Speech Processing, ROCLING 2012
dc.subject (關鍵詞)	Chinese segmentation; High quality; Open-source softwares; Parallel corpora; Segmentation models; Training data; Computational linguistics; Speech processing
dc.title (題名)	Applications of parallel corpora for Chinese segmentation
dc.type (資料類型)	conference	en