學術產出-Proceedings

Article View/Open

Publication Export

Google ScholarTM

政大圖書館

Citation Infomation

  • No doi shows Citation Infomation
題名 Introducing a large corpus of tokenized classical Chinese poems of Tang and Song dynasties
作者 劉昭麟
Liu, Chao-Lin;Zheng, Ti-Yong;Chen, Kuan-Chun;Chung, Meng-Han
貢獻者 資訊系
日期 2022-11
上傳時間 30-Nov-2023 11:26:40 (UTC+8)
摘要 Classical Chinese poems of Tang and Song dynasties are an important part for the studies of Chinese literature. To thoroughly understand the poems, properly segmenting the verses is an important step for human readers and software agents. Yet, due to the availability of data and the costs of annotation, there are still no known large and useful sources that offer classical Chinese poems with annotated word boundaries. In this project, annotators with Chinese literature background labeled 32399 poems. We analyzed the annotated patterns and conducted inter-rater agreement studies about the annotations. The distributions of the annotated patterns for poem lines are very close to some well-known professional heuristics, i.e., that the 2-2-1, 2-1-2, 2-2-1-2, and 2-2-2-1 patterns are very frequent. The annotators agreed well at the line level, but agreed on the segmentations of a whole poem only 43% of the time. We applied a traditional machine-learning approach to segment the poems, and achieved promising results at the line level as well. Using the annotated data as the ground truth, these methods could segment only about 18% of the poems completely right under favorable conditions. Switching to deep-learning methods helped us achieved better than 30%.
關聯 Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities, 台北醫學大學,中華民國計算語言學會, pp.135-144
資料類型 conference
dc.contributor 資訊系
dc.creator (作者) 劉昭麟
dc.creator (作者) Liu, Chao-Lin;Zheng, Ti-Yong;Chen, Kuan-Chun;Chung, Meng-Han
dc.date (日期) 2022-11
dc.date.accessioned 30-Nov-2023 11:26:40 (UTC+8)-
dc.date.available 30-Nov-2023 11:26:40 (UTC+8)-
dc.date.issued (上傳時間) 30-Nov-2023 11:26:40 (UTC+8)-
dc.identifier.uri (URI) https://nccur.lib.nccu.edu.tw/handle/140.119/148302-
dc.description.abstract (摘要) Classical Chinese poems of Tang and Song dynasties are an important part for the studies of Chinese literature. To thoroughly understand the poems, properly segmenting the verses is an important step for human readers and software agents. Yet, due to the availability of data and the costs of annotation, there are still no known large and useful sources that offer classical Chinese poems with annotated word boundaries. In this project, annotators with Chinese literature background labeled 32399 poems. We analyzed the annotated patterns and conducted inter-rater agreement studies about the annotations. The distributions of the annotated patterns for poem lines are very close to some well-known professional heuristics, i.e., that the 2-2-1, 2-1-2, 2-2-1-2, and 2-2-2-1 patterns are very frequent. The annotators agreed well at the line level, but agreed on the segmentations of a whole poem only 43% of the time. We applied a traditional machine-learning approach to segment the poems, and achieved promising results at the line level as well. Using the annotated data as the ground truth, these methods could segment only about 18% of the poems completely right under favorable conditions. Switching to deep-learning methods helped us achieved better than 30%.
dc.format.extent 105 bytes-
dc.format.mimetype text/html-
dc.relation (關聯) Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities, 台北醫學大學,中華民國計算語言學會, pp.135-144
dc.title (題名) Introducing a large corpus of tokenized classical Chinese poems of Tang and Song dynasties
dc.type (資料類型) conference