基於Transformer之多任務學習用於清代奏摺斷句斷詞命名實體識別

薛卉吟; Xue, Hui-Yin

Please use this identifier to cite or link to this item: https://ah.lib.nccu.edu.tw/handle/140.119/138003

DC Field	Value	Language
dc.contributor.advisor	蔡瑞煌<br>黃瀚萱	zh_TW
dc.contributor.advisor	Tsaih, Rua-Huan<br>Huang, Hen-Hsen	en_US
dc.contributor.author	薛卉吟	zh_TW
dc.contributor.author	Xue, Hui-Yin	en_US
dc.creator	薛卉吟	zh_TW
dc.creator	Xue, Hui-Yin	en_US
dc.date	2021	en_US
dc.date.accessioned	2021-12-01T06:30:04Z	-
dc.date.available	2021-12-01T06:30:04Z	-
dc.date.issued	2021-12-01T06:30:04Z	-
dc.identifier	G0108356036	en_US
dc.identifier.uri	http://nccur.lib.nccu.edu.tw/handle/140.119/138003	-
dc.description	碩士	zh_TW
dc.description	國立政治大學	zh_TW
dc.description	資訊管理學系	zh_TW
dc.description	108356036	zh_TW
dc.description.abstract	奏摺，是研究清代政策實施和法制建設的珍貴的史料。雖然存於國立故宮博物院的清代宮中檔及軍機處的奏摺已完成數化，但應用仍然不普及，原因之一是辨識古典漢語的斷句、斷詞和詞義需花費歷史學家大量的時間。對於古典漢語，很少有有用的自然語言處理（NLP）工具，並且先進的人工智能（AI）模型學習不同朝代的訓練數據後，其性能也不盡相同。此外，沒有合適的NLP工具來分析清代的奏摺。為了解決有關於分析清代奏摺的挑戰，本研究探索一種基於Transformer之單任務學習（STL）及多任務學習（MTL）之模型，該模型可同時應付以下三個任務：斷句、斷詞、詞性（POS）標記和命名實體識別（NER）。為了完成此任務，本研究建議的標記方案包括三個部分：（1）用於斷句的BOE格式標籤；（2）用於斷詞的BIES格式標籤；以及（3）用於POS和NER的聯合標籤。為了評估該提案，本研究著重於雍正皇帝時期之奏摺，並收集並建立由中文專業人士參照新標籤標記方案所標註的清朝宮中檔奏摺數據集。研究結果顯示，斷句及斷詞任務中，多任務學習效能顯著優於單任務學習，兩個學習方法在詞性標記和命名實體識別則無顯著差異。模型的斷句結果可以達到輔助初學者們閱讀奏摺，斷詞以及詞性的標注結果則可以協助學者辨認詞義，減少對詞義誤讀的可能。	zh_TW
dc.description.abstract	Memorials are important materials for research on policy implementation and the formation of legal institutions. Although the memorials of Qing palace and the Grand Council had been accomplished with image scanning, the application is still not popular in academia. One of the reasons is that classical Chinese will often take a lot of historian’s time to determine the segmentation of sentences and the meaning of words. The use of natural language processing (NLP) tools for analyzing classical Chinese remains an emerging topic in the digital humanity community. For classical Chinese, there are few NLP tools, and the performance of artificial intelligence (AI) models is not the same after learning the data of different dynasties. To address the challenges regarding the memorials of Qing dynasty, this study proposes a classical Chinese analysis model with transformer-based single task learning (STL) and multitask learning (MTL) that simultaneously copes with three tasks for classical Chinese: word segmentation, sentence segmentation, and the joint task for part-of-speech (POS) tagging and named entity recognition (NER). To accomplish the goal, the labels have three parts: (1) BOE format tags for sentence segmentation, (2) BIES format tags for word segmentation, and (3) the joint tags for POS and NER. For evaluating the proposal, this study focuses on the Yong-zheng (雍正) emperor and the Qing’s memorials dataset annotated with new tagging schemes by Chinese professionals is collected. The research results show that method MTL performs significantly better on both sentence segmentation task and word segmentation task than method STL. And on POS+NER task, there is no significant difference between the two methods. The prediction of the memorials can help scholars to read memorials easily and reduce the probability of misinterpretation of word meaning.	en_US
dc.description.tableofcontents	1 INTRODUCTION 7\n2 PREVIOUS WORKS 9\n2.1 Qing Palace Memorials of National Palace Museum 9\n2.2 Chinese Text Classification Tasks 10\n2.3 Bidirectional Encoder Representations from Transformers 12\n2.4 RNN-based Multi-Task Learning 14\n2.5 Bidirectional Gate Recurrent Unit 15\n3 EXPERIMENT DESIGN 17\n3.1 Models 17\n3.2 Input X 20\n3.3 Output Tags 20\n3.3.1 Sentence Segmentation Tags 20\n3.3.2 Word Segmentation Tags 21\n3.3.3 Joint Tags of POS and NER 21\n3.3.4 Example 23\n3.4 Dataset 24\n3.4.1 Data Collection for Qing’s Dataset 24\n3.4.2 Data Labeling for the Qing’s Dataset 26\n3.4.3 Statistical Description of the Qing’s Dataset 27\n3.5 Experiment Environment 28\n3.7 Evaluation 29\n4 EXPERIMENTS 30\n4.1 Preprocessing 30\n4.2 Training 30\n4.3 Evaluation 31\n4.4 Comparisons 34\n4.4.1 Residual Connection 34\n4.4.2 Compare with Other Models 34\n4.4.3 Compare with Other Chinese NLP Tools 35\n4.4.4 Different Tagging Scheme of POS+NER 35\n4.4.5 Different Granularity of Word Segmentation 36\n4.5 Discussion 37\n5 CONCLUSION 41\nREFERANCE 43\nAPPENDIX 46\nChinese Version of Interview and Feedback 46	zh_TW
dc.format.extent	1466792 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri	http://thesis.lib.nccu.edu.tw/record/#G0108356036	en_US
dc.subject	清代奏摺	zh_TW
dc.subject	斷詞斷句	zh_TW
dc.subject	命名實體識別	zh_TW
dc.subject	多任務學習	zh_TW
dc.subject	自然語言處理	zh_TW
dc.subject	機器學習	zh_TW
dc.subject	古文	zh_TW
dc.subject	Memorial	en_US
dc.subject	Qing Dynasty	en_US
dc.subject	Transformer	en_US
dc.subject	BERT	en_US
dc.subject	Sentence segmentation	en_US
dc.subject	Word segmentation	en_US
dc.subject	Name entity recognition	en_US
dc.subject	Multitask learning	en_US
dc.subject	Classical Chinese	en_US
dc.subject	NLP	en_US
dc.title	基於Transformer之多任務學習用於清代奏摺斷句斷詞命名實體識別	zh_TW
dc.title	Text Segmentation and Name Entity Recognition for Memorials from the Qing Dynasty with Transformer-based Multitask Learning	en_US
dc.type	thesis	en_US
dc.relation.reference	[1]莊吉發 (1983)。故宮檔案述要。國立故宮博物院。\n[2]袁晖、管锡华、岳方遂 (2002)。汉语标点符号流变史。湖北教育出版社。\n[3]黃宇暘、郭鎮武、周維強、林國平、蔡瑞煌 (2021)。人工智慧在中文歷史文獻判讀領域應用初探：以國立故宮博物院典藏為例。科技博物，25(3)，5-26。\n[4]Cai, D., & Zhao, H. (2016). Neural word segmentation learning for Chinese. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics , 1, 409–420. https://aclanthology.org/P16-1039/\n[5]Caruana, R. (1997). Multitask learning. Machine learning, 28(1), 41-75. https://doi.org/10.1023/A:1007379606734\n[6]Chang, C. H., & Chen, C. D. (1993). HMM-based part-of-speech tagging for Chinese corpora. Very Large Corpora: Academic and Industrial Perspectives. https://aclanthology.org/W93-0305\n[7]Chen X., Li B., Feng M., Xu C., Xu R., Shi M., Yu L., Xiao L., & Wang Q. (2017). Ancient Chinese Corpus LDC2017T14. Philadelphia: Linguistic Data Consortium. https://doi.org/10.35111/ctjv-ez04\n[8]Chen, J., Qiu, X., Liu, P., & Huang, X. (2018). Meta multi-task learning for sequence modeling. Proceedings of the AAAI Conference on Artificial Intelligenc, 32(1). https://ojs.aaai.org/index.php/AAAI/article/view/12007\n[9]Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. Conference on Empirical Methods in Natural Language Processing (EMNLP 2014). https://arxiv.org/abs/1406.1078\n[10]Dai, Z., Wang, X., Ni, P., Li, Y., Li, G., & Bai, X. (2019). Named entity recognition using bert bilstm crf for chinese electronic health records. 2019 12th international congress on image and signal processing, biomedical engineering and informatics (cisp-bmei), 1-5. IEEE.\n[11]Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1, 4171–4186. https://aclanthology.org/N19-1423\n[12]Gong, L., Zhang, Z., & Chen, S. (2020). Clinical Named Entity Recognition from Chinese Electronic Medical Records Based on Deep Learning Pretraining. Journal of Healthcare Engineering, 2020. https://doi.org/10.1155/2020/8829219\n[13]Gu, C., Wu, M., & Zhang, C. (2017). Chinese sentence classification based on convolutional neural network. 2017 International Conference on Artificial Intelligence Applications and Technologies (AIAAT 2017), Hawaii, USA. https://iopscience.iop.org/article/10.1088/1757-899X/261/1/012008\n[14]Han, X., Wang, H., Zhang, S., Fu, Q., & Liu, J. (2019). Sentence segmentation for classical Chinese based on LSTM with radical embedding. The Journal of China Universities of Posts and Telecommunications, 26(02). doi: 10.19682/j.cnki.1005-8885.2019.1001\n[15]Huang, H. H., Sun, C. T., & Chen, H. H. (2010). Classical Chinese sentence segmentation. CIPS-SIGHAN joint conference on Chinese language processing. https://aclanthology.org/W10-4103/\n[16]Huang, S., & Wu, J. (2018). A pragmatic approach for classical Chinese word segmentation. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). https://aclanthology.org/L18-1186\n[17]Jiao, Z., Sun, S., & Sun, K. (2018). Chinese lexical analysis with deep Bi-GRU-CRF network. arXiv preprint. https://arxiv.org/abs/1807.01882\n[18]Liu, P., Qiu, X., & Huang, X. (2016). Recurrent neural network for text classification with multi-task learning. Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16), 2873–2879. https://arxiv.org/abs/1605.05101\n[19]Ma, J., Ganchev, K., & Weiss, D. (2018). State-of-the-art Chinese word segmentation with Bi-LSTMs. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 4902–4908. https://aclanthology.org/D18-1529/\n[20]Ng, H. T., & Low, J. K. (2004). Chinese part-of-speech tagging: One-at-a-time or all-at-once? word-based or character-based?. Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, 277-284. https://aclanthology.org/W04-3236\n[21]Norman, J., & Jerry, N. (1988). Chinese. Cambridge University Press.\n[22]Panchendrarajan, R., & Amaresan, A. (2018). Bidirectional LSTM-CRF for Named Entity Recognition. Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation. https://aclanthology.org/Y18-1061\n[23]Qin, Q., Zhao, S., & Liu, C. (2021). A BERT-BiGRU-CRF Model for Entity Recognition of Chinese Electronic Medical Records. Complexity, 2021. https://doi.org/10.1155/2021/6631837\n[24]Ratnaparkhi, A. (1996). A maximum entropy model for part-of-speech tagging. Proceedings of the Conference on Empirical Methods in Natural Language Processing, 133-142. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.47.5102\n[25]Shao, Y., Hardmeier, C., Tiedemann, J., & Nivre, J. (2017). Character-based joint segmentation and POS tagging for Chinese using bidirectional RNN-CRF. Proceedings of the Eighth International Joint Conference on Natural Language Processing, 1, 173–183. https://aclanthology.org/I17-1018\n[26]Shi, M., Li, B., & Chen, X. (2010). CRF based research on a unified approach to word segmentation and POS tagging for Pre-Qin Chinese. Journal of Chinese Information Processing, 2(24), 39-46. http://jcip.cipsc.org.cn/CN/Y2010/V24/I2/39\n[27]Tian, Y., Song, Y., Ao, X., Xia, F., Quan, X., Zhang, T., & Wang, Y. (2020). Joint Chinese word segmentation and part-of-speech tagging via two-way attentions of auto-analyzed knowledge. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 8286-8296. https://aclanthology.org/2020.acl-main.735/\n[28]Tian, Y., Song, Y., & Xia, F. (2020). Joint Chinese Word Segmentation and Part-of-speech Tagging via Multi-channel Attention of Character N-grams. Proceedings of the 28th International Conference on Computational Linguistics, 2073-2084. https://aclanthology.org/2020.coling-main.187/\n[29]Wang, Q., & Zeng, L. (2018). Chinese symptom component recognition via bidirectional LSTM-CRF. 2018 Tenth International Conference on Advanced Computational Intelligence (ICACI) , 45-50. IEEE. doi: 10.1109/ICACI.2018.8377564.\n[30]Wilkinson, E. P. (2000). Chinese history: a manual. Harvard Univ Asia Center.\n[31]Wu, Y., Jiang, M., Lei, J., & Xu, H. (2015). Named Entity Recognition in Chinese Clinical Text Using Deep Neural Network. Studies in health technology and informatics, 216, 624-628.\n[32]Zhang, H. P., Yu, H. K., Xiong, D., & Liu, Q. (2003). HHMM-based Chinese lexical analyzer ICTCLAS. Proceedings of the second SIGHAN workshop on Chinese language processing, 17, 184-187. https://doi.org/10.3115/1119250.1119280\n[33]Zhang, H. P., Liu, Q., Yu, H. K., Cheng, X., & Bai, S. (2003). Chinese named entity recognition using role model. International Journal of Computational Linguistics & SChinese Language Processing, 8(2), 29-60. https://aclanthology.org/O03-5002	zh_TW
dc.identifier.doi	10.6814/NCCU202101726	en_US
item.openairetype	thesis	-
item.fulltext	With Fulltext	-
item.cerifentitytype	Publications	-
item.grantfulltext	open	-
item.openairecristype	http://purl.org/coar/resource_type/c_46ec	-
Appears in Collections:	學位論文

Files in This Item:

File	Description	Size	Format
603601.pdf		1.43 MB	Adobe PDF2	View/Open

Show simple item record

Google Scholar^TM

Check

Files in This Item:

Google Scholar^TM

Altmetric

Altmetric

Files in This Item:

Google ScholarTM

Altmetric

Altmetric

Google Scholar^TM