dc.contributor | 資訊系 | |
dc.creator (作者) | 劉昭麟 | |
dc.creator (作者) | Liu, Chao-Lin;Zheng, Ti-Yong;Chen, Kuan-Chun;Chung, Meng-Han | |
dc.date (日期) | 2022-11 | |
dc.date.accessioned | 30-Nov-2023 11:26:40 (UTC+8) | - |
dc.date.available | 30-Nov-2023 11:26:40 (UTC+8) | - |
dc.date.issued (上傳時間) | 30-Nov-2023 11:26:40 (UTC+8) | - |
dc.identifier.uri (URI) | https://nccur.lib.nccu.edu.tw/handle/140.119/148302 | - |
dc.description.abstract (摘要) | Classical Chinese poems of Tang and Song dynasties are an important part for the studies of Chinese literature. To thoroughly understand the poems, properly segmenting the verses is an important step for human readers and software agents. Yet, due to the availability of data and the costs of annotation, there are still no known large and useful sources that offer classical Chinese poems with annotated word boundaries. In this project, annotators with Chinese literature background labeled 32399 poems. We analyzed the annotated patterns and conducted inter-rater agreement studies about the annotations. The distributions of the annotated patterns for poem lines are very close to some well-known professional heuristics, i.e., that the 2-2-1, 2-1-2, 2-2-1-2, and 2-2-2-1 patterns are very frequent. The annotators agreed well at the line level, but agreed on the segmentations of a whole poem only 43% of the time. We applied a traditional machine-learning approach to segment the poems, and achieved promising results at the line level as well. Using the annotated data as the ground truth, these methods could segment only about 18% of the poems completely right under favorable conditions. Switching to deep-learning methods helped us achieved better than 30%. | |
dc.format.extent | 105 bytes | - |
dc.format.mimetype | text/html | - |
dc.relation (關聯) | Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities, 台北醫學大學,中華民國計算語言學會, pp.135-144 | |
dc.title (題名) | Introducing a large corpus of tokenized classical Chinese poems of Tang and Song dynasties | |
dc.type (資料類型) | conference | |