Please use this identifier to cite or link to this item:

Title: 基於Transformer之多任務學習用於清代奏摺斷句斷詞命名實體識別
Text Segmentation and Name Entity Recognition for Memorials from the Qing Dynasty with Transformer-based Multitask Learning
Authors: 薛卉吟
Xue, Hui-Yin
Contributors: 蔡瑞煌

Tsaih, Rua-Huan
Huang, Hen-Hsen

Xue, Hui-Yin
Keywords: 清代奏摺
Qing Dynasty
Sentence segmentation
Word segmentation
Name entity recognition
Multitask learning
Classical Chinese
Date: 2021
Issue Date: 2021-12-01 14:30:04 (UTC+8)
Abstract: 奏摺,是研究清代政策實施和法制建設的珍貴的史料。雖然存於國立故宮博物院的清代宮中檔及軍機處的奏摺已完成數化,但應用仍然不普及,原因之一是辨識古典漢語的斷句、斷詞和詞義需花費歷史學家大量的時間。對於古典漢語,很少有有用的自然語言處理(NLP)工具,並且先進的人工智能(AI)模型學習不同朝代的訓練數據後,其性能也不盡相同。此外,沒有合適的NLP工具來分析清代的奏摺。為了解決有關於分析清代奏摺的挑戰,本研究探索一種基於Transformer之單任務學習(STL)及多任務學習(MTL)之模型,該模型可同時應付以下三個任務:斷句、斷詞、詞性(POS)標記和命名實體識別(NER)。為了完成此任務,本研究建議的標記方案包括三個部分:(1)用於斷句的BOE格式標籤;(2)用於斷詞的BIES格式標籤;以及(3)用於POS和NER的聯合標籤。為了評估該提案,本研究著重於雍正皇帝時期之奏摺,並收集並建立由中文專業人士參照新標籤標記方案所標註的清朝宮中檔奏摺數據集。研究結果顯示,斷句及斷詞任務中,多任務學習效能顯著優於單任務學習,兩個學習方法在詞性標記和命名實體識別則無顯著差異。模型的斷句結果可以達到輔助初學者們閱讀奏摺,斷詞以及詞性的標注結果則可以協助學者辨認詞義,減少對詞義誤讀的可能。
Memorials are important materials for research on policy implementation and the formation of legal institutions. Although the memorials of Qing palace and the Grand Council had been accomplished with image scanning, the application is still not popular in academia. One of the reasons is that classical Chinese will often take a lot of historian’s time to determine the segmentation of sentences and the meaning of words. The use of natural language processing (NLP) tools for analyzing classical Chinese remains an emerging topic in the digital humanity community. For classical Chinese, there are few NLP tools, and the performance of artificial intelligence (AI) models is not the same after learning the data of different dynasties. To address the challenges regarding the memorials of Qing dynasty, this study proposes a classical Chinese analysis model with transformer-based single task learning (STL) and multitask learning (MTL) that simultaneously copes with three tasks for classical Chinese: word segmentation, sentence segmentation, and the joint task for part-of-speech (POS) tagging and named entity recognition (NER). To accomplish the goal, the labels have three parts: (1) BOE format tags for sentence segmentation, (2) BIES format tags for word segmentation, and (3) the joint tags for POS and NER. For evaluating the proposal, this study focuses on the Yong-zheng (雍正) emperor and the Qing’s memorials dataset annotated with new tagging schemes by Chinese professionals is collected. The research results show that method MTL performs significantly better on both sentence segmentation task and word segmentation task than method STL. And on POS+NER task, there is no significant difference between the two methods. The prediction of the memorials can help scholars to read memorials easily and reduce the probability of misinterpretation of word meaning.
Reference: [1]莊吉發 (1983)。故宮檔案述要。國立故宮博物院。
[2]袁晖、管锡华、岳方遂 (2002)。汉语标点符号流变史。湖北教育出版社。
[3]黃宇暘、郭鎮武、周維強、林國平、蔡瑞煌 (2021)。人工智慧在中文歷史文獻判讀領域應用初探:以國立故宮博物院典藏為例。科技博物,25(3),5-26。
[4]Cai, D., & Zhao, H. (2016). Neural word segmentation learning for Chinese. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics , 1, 409–420.
[5]Caruana, R. (1997). Multitask learning. Machine learning, 28(1), 41-75.
[6]Chang, C. H., & Chen, C. D. (1993). HMM-based part-of-speech tagging for Chinese corpora. Very Large Corpora: Academic and Industrial Perspectives.
[7]Chen X., Li B., Feng M., Xu C., Xu R., Shi M., Yu L., Xiao L., & Wang Q. (2017). Ancient Chinese Corpus LDC2017T14. Philadelphia: Linguistic Data Consortium.
[8]Chen, J., Qiu, X., Liu, P., & Huang, X. (2018). Meta multi-task learning for sequence modeling. Proceedings of the AAAI Conference on Artificial Intelligenc, 32(1).
[9]Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. Conference on Empirical Methods in Natural Language Processing (EMNLP 2014).
[10]Dai, Z., Wang, X., Ni, P., Li, Y., Li, G., & Bai, X. (2019). Named entity recognition using bert bilstm crf for chinese electronic health records. 2019 12th international congress on image and signal processing, biomedical engineering and informatics (cisp-bmei), 1-5. IEEE.
[11]Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1, 4171–4186.
[12]Gong, L., Zhang, Z., & Chen, S. (2020). Clinical Named Entity Recognition from Chinese Electronic Medical Records Based on Deep Learning Pretraining. Journal of Healthcare Engineering, 2020.
[13]Gu, C., Wu, M., & Zhang, C. (2017). Chinese sentence classification based on convolutional neural network. 2017 International Conference on Artificial Intelligence Applications and Technologies (AIAAT 2017), Hawaii, USA.
[14]Han, X., Wang, H., Zhang, S., Fu, Q., & Liu, J. (2019). Sentence segmentation for classical Chinese based on LSTM with radical embedding. The Journal of China Universities of Posts and Telecommunications, 26(02). doi: 10.19682/j.cnki.1005-8885.2019.1001
[15]Huang, H. H., Sun, C. T., & Chen, H. H. (2010). Classical Chinese sentence segmentation. CIPS-SIGHAN joint conference on Chinese language processing.
[16]Huang, S., & Wu, J. (2018). A pragmatic approach for classical Chinese word segmentation. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
[17]Jiao, Z., Sun, S., & Sun, K. (2018). Chinese lexical analysis with deep Bi-GRU-CRF network. arXiv preprint.
[18]Liu, P., Qiu, X., & Huang, X. (2016). Recurrent neural network for text classification with multi-task learning. Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16), 2873–2879.
[19]Ma, J., Ganchev, K., & Weiss, D. (2018). State-of-the-art Chinese word segmentation with Bi-LSTMs. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 4902–4908.
[20]Ng, H. T., & Low, J. K. (2004). Chinese part-of-speech tagging: One-at-a-time or all-at-once? word-based or character-based?. Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, 277-284.
[21]Norman, J., & Jerry, N. (1988). Chinese. Cambridge University Press.
[22]Panchendrarajan, R., & Amaresan, A. (2018). Bidirectional LSTM-CRF for Named Entity Recognition. Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation.
[23]Qin, Q., Zhao, S., & Liu, C. (2021). A BERT-BiGRU-CRF Model for Entity Recognition of Chinese Electronic Medical Records. Complexity, 2021.
[24]Ratnaparkhi, A. (1996). A maximum entropy model for part-of-speech tagging. Proceedings of the Conference on Empirical Methods in Natural Language Processing, 133-142.
[25]Shao, Y., Hardmeier, C., Tiedemann, J., & Nivre, J. (2017). Character-based joint segmentation and POS tagging for Chinese using bidirectional RNN-CRF. Proceedings of the Eighth International Joint Conference on Natural Language Processing, 1, 173–183.
[26]Shi, M., Li, B., & Chen, X. (2010). CRF based research on a unified approach to word segmentation and POS tagging for Pre-Qin Chinese. Journal of Chinese Information Processing, 2(24), 39-46.
[27]Tian, Y., Song, Y., Ao, X., Xia, F., Quan, X., Zhang, T., & Wang, Y. (2020). Joint Chinese word segmentation and part-of-speech tagging via two-way attentions of auto-analyzed knowledge. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 8286-8296.
[28]Tian, Y., Song, Y., & Xia, F. (2020). Joint Chinese Word Segmentation and Part-of-speech Tagging via Multi-channel Attention of Character N-grams. Proceedings of the 28th International Conference on Computational Linguistics, 2073-2084.
[29]Wang, Q., & Zeng, L. (2018). Chinese symptom component recognition via bidirectional LSTM-CRF. 2018 Tenth International Conference on Advanced Computational Intelligence (ICACI) , 45-50. IEEE. doi: 10.1109/ICACI.2018.8377564.
[30]Wilkinson, E. P. (2000). Chinese history: a manual. Harvard Univ Asia Center.
[31]Wu, Y., Jiang, M., Lei, J., & Xu, H. (2015). Named Entity Recognition in Chinese Clinical Text Using Deep Neural Network. Studies in health technology and informatics, 216, 624-628.
[32]Zhang, H. P., Yu, H. K., Xiong, D., & Liu, Q. (2003). HHMM-based Chinese lexical analyzer ICTCLAS. Proceedings of the second SIGHAN workshop on Chinese language processing, 17, 184-187.
[33]Zhang, H. P., Liu, Q., Yu, H. K., Cheng, X., & Bai, S. (2003). Chinese named entity recognition using role model. International Journal of Computational Linguistics & SChinese Language Processing, 8(2), 29-60.
Description: 碩士
Source URI:
Data Type: thesis
Appears in Collections:[資訊管理學系] 學位論文

Files in This Item:

File Description SizeFormat
603601.pdf1432KbAdobe PDF0View/Open

All items in 學術集成 are protected by copyright, with all rights reserved.

社群 sharing