Introducing a large corpus of tokenized classical Chinese poems of Tang and Song dynasties | NCCU Academic Hub

Publications-Proceedings

Article View/Open

html(160)

Publication Export

Google Scholar^TM

NCCU Library

Discovery System

Citation Infomation

No doi shows Citation Infomation

Related Publications in TAIR

Simple Record
Full Record

題名	Introducing a large corpus of tokenized classical Chinese poems of Tang and Song dynasties
作者	劉昭麟 Liu, Chao-Lin;Zheng, Ti-Yong;Chen, Kuan-Chun;Chung, Meng-Han
貢獻者	資訊系
日期	2022-11
上傳時間	30-Nov-2023 11:26:40 (UTC+8)
摘要	Classical Chinese poems of Tang and Song dynasties are an important part for the studies of Chinese literature. To thoroughly understand the poems, properly segmenting the verses is an important step for human readers and software agents. Yet, due to the availability of data and the costs of annotation, there are still no known large and useful sources that offer classical Chinese poems with annotated word boundaries. In this project, annotators with Chinese literature background labeled 32399 poems. We analyzed the annotated patterns and conducted inter-rater agreement studies about the annotations. The distributions of the annotated patterns for poem lines are very close to some well-known professional heuristics, i.e., that the 2-2-1, 2-1-2, 2-2-1-2, and 2-2-2-1 patterns are very frequent. The annotators agreed well at the line level, but agreed on the segmentations of a whole poem only 43% of the time. We applied a traditional machine-learning approach to segment the poems, and achieved promising results at the line level as well. Using the annotated data as the ground truth, these methods could segment only about 18% of the poems completely right under favorable conditions. Switching to deep-learning methods helped us achieved better than 30%.
關聯	Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities, 台北醫學大學，中華民國計算語言學會, pp.135-144
資料類型	conference

dc.contributor	資訊系
dc.creator (作者)	劉昭麟
dc.creator (作者)	Liu, Chao-Lin;Zheng, Ti-Yong;Chen, Kuan-Chun;Chung, Meng-Han
dc.date (日期)	2022-11
dc.date.accessioned	30-Nov-2023 11:26:40 (UTC+8)	-
dc.date.available	30-Nov-2023 11:26:40 (UTC+8)	-
dc.date.issued (上傳時間)	30-Nov-2023 11:26:40 (UTC+8)	-
dc.identifier.uri (URI)	https://nccur.lib.nccu.edu.tw/handle/140.119/148302	-
dc.description.abstract (摘要)	Classical Chinese poems of Tang and Song dynasties are an important part for the studies of Chinese literature. To thoroughly understand the poems, properly segmenting the verses is an important step for human readers and software agents. Yet, due to the availability of data and the costs of annotation, there are still no known large and useful sources that offer classical Chinese poems with annotated word boundaries. In this project, annotators with Chinese literature background labeled 32399 poems. We analyzed the annotated patterns and conducted inter-rater agreement studies about the annotations. The distributions of the annotated patterns for poem lines are very close to some well-known professional heuristics, i.e., that the 2-2-1, 2-1-2, 2-2-1-2, and 2-2-2-1 patterns are very frequent. The annotators agreed well at the line level, but agreed on the segmentations of a whole poem only 43% of the time. We applied a traditional machine-learning approach to segment the poems, and achieved promising results at the line level as well. Using the annotated data as the ground truth, these methods could segment only about 18% of the poems completely right under favorable conditions. Switching to deep-learning methods helped us achieved better than 30%.
dc.format.extent	105 bytes	-
dc.format.mimetype	text/html	-
dc.relation (關聯)	Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities, 台北醫學大學，中華民國計算語言學會, pp.135-144
dc.title (題名)	Introducing a large corpus of tokenized classical Chinese poems of Tang and Song dynasties
dc.type (資料類型)	conference