Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

題名 應用序列標記技術於地方志的實體名詞辨識
Named Entity Recognition in Difangzhi Using Sequential Labeling Techniques
作者 黃致凱
Huang, Chih Kai
貢獻者 劉昭麟
Liu, Chao Lin
黃致凱
Huang, Chih Kai
關鍵詞 文字探勘
實體名詞辨識
機器學習
數位人文
Text Mining
Named Entity Recognition
Machine Learning
Digital Humanity
日期 2016
上傳時間 9-Aug-2016 11:24:27 (UTC+8)
摘要 地方志是中國過去由官方編輯的地方記事的資料,其內容包含廣泛,包含人物傳記、地理環境、任官紀錄等等,從中包含了很多現在還沒被整理出的人、事、物,由於地方志文本使用的詞彙、語法架構與現今的中文有相當大的差異,且文本中大多數沒有標點符號,所以面對的是沒有經過斷詞、斷句、斷段落的序列文字資料,所以並不適用現有的自然語言處理工具來做處理分析。因此,本研究針對地方志類型的資料去建立對應的實體名詞辨識模型,以序列標記方式標記出人名與地名的資訊,以及加入官職、入仕、年號以及日期等標記資訊,透過標記資料去從中找出更多中國古代人物的資料。
本研究透過監督式學習的方式去做機器學習來產生序列標記模型,首先從過去整理好的地方志中的人物資訊,抽取人、地名的資訊,並配合已知的名詞表來標記過去曾處理過的地方志語料,即使透過人工整理,過去所整理的資料還是有不正確的地方,這裡先經由前處理對資料都進一步的整理,然後標記時會產生歧義性的問題,我們提出了三種方法來進行標記,來解決歧義問題,並透過條件隨機場作為序列標記模型,同時配合名詞表、規則去做預先標記。透過實驗,去對未處理過的地方志語料做實體名詞辨識,辨識人名準確率皆可達到80%以上,另外再地名辨識部分可達到86%,能有如此好的辨識效果主因在於整理好的地方志語料與實驗語料之間敘述及記錄方式相似度是相當高的。運用標記的結果,試著用簡易的方法來做連結人名與地名資訊的實驗,找出語料中的人地名關聯資料,取樣作人工驗證,取樣結果說明我們的方法能有效的連結特定語法下的人名與地名;為了在未來的研究中,能夠做更深入的研究,嘗試從文本中切割出人物條目,運用地方志已知的特性,配合有限狀態機模型來判斷是否為條目開頭,雖能找出部分開頭,但會有許多遺漏狀況。
在未來的研究中,試著加入更多類型的標記,並做更完善的標記設計,讓辨識效果能有更多的提升,同時為了抓出更精確的人物資訊,除了嘗試段落切割、斷句之外,將試著做地方志的語法分析,確實的抓出語法結構來做人物與其他實體名詞的連結,自動化去整理出更完善的人物資訊。
Difangzhi is the local gazetteers compiled by local government of China. Its content is plenty and extensive. It’s including many undetected information, like biographical information, geographical information, and officer record information and so on. Because of the difference between Difangzhi corpus and modern Chinese language, we should not use current natural language processing tools directly. In order to extract biographical information, we construct our model to recognize the named entity and use the noun list to assist our annotation method in Difangzhi corpus.
In this study, we use supervised learning to construct our model. At first, we need to generate our training data. According to the personal information list with manual verification and noun lists, we have reliable information to annotate words in Difangzhi corpus. However, they still have some noise in those lists. As a result, we must do the preprocessing to those lists for cleaning. After, the ambiguity problem will happen when we trying to annotate our corpus. Here we provide three methods to annotate our corpus with disambiguation. Using the annotated corpus to generate training data and built the condition random fields models. In our experiment, we use our models generated by three different annotate methods to predict the character label in testing Difangzhi corpus. According to the labeled result, we extract the person name and address name to evaluate. The result shows the precision of person name recognition is over 80%, and precision of address name recognition is about 86%. Because of the training corpus and test corpus is quite similar, the performances of our model is pretty well. Therefore, we use labeled result to find correlation of person name and address name. Using a simple way to connect person name and address name and sampling the result to evaluate. The sample result shows we could connect person name and address name correctly in some specific grammars. In order to analyze more deeply, we attempt to split clauses in Difangzhi corpus. Use finite state machine model to recognize the beginning of clauses. Although the result shows we could find some beginning of clauses, but our method still lose many beginning of clauses.
In the future work, we attempt to add more information to annotate Difangzhi corpus and modify our disambiguated methods to make the recognition result better. In order to get more information about the person in the corpus, we will try to split paragraphs or sentences more precisely. Besides, we also try to analyze grammar in the corpus. Finding useful pattern to connect person name and other entities, like address name, officer name and so on. Generating the information about people appears in the corpus automatically.
參考文獻 [1] 中國歷代人物傳記資料庫,http://projects.iq.harvard.edu/chinesecbdb/home [last visited 2016/7/26] 。
[2] 地方志介紹,http://baike.baidu.com/view/143397.htm [last visited 2016/6/17]。
[3] 杜協昌,半自動詞彙擷取:簡化的詞夾子方法以及JavaScript元件開發及應用,第六屆數位典藏與數位人文國際研討會論文集,391-418,2015。
[4] 金觀濤、邱偉雲、劉昭麟,「共現」詞頻分析及其運用-以「華人」觀念起源為例,第三屆數位典藏與數位人文國際研討會論文集,199-223,2011。
[5] 異體字介紹,https://zh.wikipedia.org/wiki/異體字 [last visited 2016/6/18]。
[6] 異體字整理表,http://www.china-language.gov.cn/wenziguifan [last visited 2016/6/18]。
[7] 張尚斌,詞夾子演算法在專有名詞辨識上的應用──以歷史文件為例,國立台灣大學,碩士論文,2006。
[8] 陳叔倬、李其原、C. Isett、S. Morgan,18世紀中國常民的身高分布、營養、與福利-初步分析報告,第三屆數位典藏與數位人文國際研討會論文集,83-93,2011。
[9] 彭維謙、劉士綱、杜協昌、翁稷安、項潔,自動擷取中文典籍中人名之嘗試 ──以PMI斷詞於《資治通鑑》的應用為例,數位人文研究與技藝,國立台灣大學出版中心,139-163,2012。
[10] 劉吉軒、柯雲娥、張惠真、譚修雯、黃瑞期、甯格致,以文本分析呈現臺灣海外史料政治思想輪廓,第三屆數位典藏與數位人文國際研討會論文集,169-198,2011。
[11] K. Black, Sampling and Sampling Distributions, Business Statistics for Contemporary Decision Making, 216-241, Wiley, 2009.

[12] K.-J. Chen and S.-H. Liu, Word Identification for Mandarin Chinese Sentences, Proceedings of International Conference on Computational Linguistics, 101-107, 1992.
[13] I. S. Dhillon and D. S. Modha, Concept Decompositions for Large Sparse Text Data Using Clustering, Journal of Machine Learning, 42(1-2), 143-175, 2001.
[14] R. Grossma, G. Seni, J. Elder, N. Agawal and H. Liu, Model Complexity, Model Selection and Regularization, Ensemble Methods in Data Mining, Improving Accuracy Through Combining Predictions, 21-38, Morgan and Claypool, 2010.
[15] R. Grishman and B. Sundheim, Sixth Message Understanding Conference: A Brief History, Proceedings of the 16th Conference on Computational linguistics, 466-471, 1996.
[16] J. Lafferty, A. McCallum and F. Pereira, Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, Proceedings of the 8th international conference on machine learning, 282-289, 2001.
[17] C.-L. Liu, G.-T. Jin, Q.-F. Liu, W.-Y. Chiu and Y.-S. Yu, Some Chances and Challenges in Applying Language Technologies to Historical Studies in Chinese, Journal of Computational Linguistics and Chinese Language Processing, 16(2), 27‒46, 2011.
[18] A. K. McCallum, MALLET: A Machine Learning for Language Toolkit, http://mallet.cs.umass.edu, 2002.
[19] W.-H. Pang, S.-P. Chen and H. Cheng, Extracting Posting Data from Chinese Local Monographs, Proceedings of International Conference of Digital Archives and Digital Humanities, 94-116, 2012.
[20] C. Sutton and A. McCallum, An Introduction to Conditional Random Fields for Relational Learning, Introduction to Statistical Relational Learning, 93-127, MIT Press, 2006.
[21] X.-G. Wang and M. Inaba, Structures and Evolution of Digital Humanities: An Empirical Research based on Correspondence Analysis and Co-word Analysis, Proceedings of International Conference of Digital Archives and Digital Humanities, 1-16, 2009.
[22] Y.-H. Wu, J. Zhao, B. Xu and H. Yu, Chinese Named Entity Recognition Based on Multiple Features, Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, 427–434, 2005.
[23] H.-P. Zhang and Q. Liu, Model of Chinese Words Rough Segmentation Based on N-Shortest Paths Method, Journal of Chinese Information Processing, 1-7, 2002.
[24] H.-P. Zhang, Q. Liu and H.-K. Yu, Chinese Named Entity Recognition Using Role Model, Journal of Computational Linguistics and Chinese Language Processing, 8(2), 29-60, 2003.
[25] Y. Zhai, Z. Rasheed and M. Shah, Conversation Detection in Feature Films Using Finite State Machines, Proceedings of 17th International Conference on Pattern Recognition, 458-461, 2004.
描述 碩士
國立政治大學
資訊科學學系
102753029
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0102753029
資料類型 thesis
dc.contributor.advisor 劉昭麟zh_TW
dc.contributor.advisor Liu, Chao Linen_US
dc.contributor.author (Authors) 黃致凱zh_TW
dc.contributor.author (Authors) Huang, Chih Kaien_US
dc.creator (作者) 黃致凱zh_TW
dc.creator (作者) Huang, Chih Kaien_US
dc.date (日期) 2016en_US
dc.date.accessioned 9-Aug-2016 11:24:27 (UTC+8)-
dc.date.available 9-Aug-2016 11:24:27 (UTC+8)-
dc.date.issued (上傳時間) 9-Aug-2016 11:24:27 (UTC+8)-
dc.identifier (Other Identifiers) G0102753029en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/99804-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊科學學系zh_TW
dc.description (描述) 102753029zh_TW
dc.description.abstract (摘要) 地方志是中國過去由官方編輯的地方記事的資料,其內容包含廣泛,包含人物傳記、地理環境、任官紀錄等等,從中包含了很多現在還沒被整理出的人、事、物,由於地方志文本使用的詞彙、語法架構與現今的中文有相當大的差異,且文本中大多數沒有標點符號,所以面對的是沒有經過斷詞、斷句、斷段落的序列文字資料,所以並不適用現有的自然語言處理工具來做處理分析。因此,本研究針對地方志類型的資料去建立對應的實體名詞辨識模型,以序列標記方式標記出人名與地名的資訊,以及加入官職、入仕、年號以及日期等標記資訊,透過標記資料去從中找出更多中國古代人物的資料。
本研究透過監督式學習的方式去做機器學習來產生序列標記模型,首先從過去整理好的地方志中的人物資訊,抽取人、地名的資訊,並配合已知的名詞表來標記過去曾處理過的地方志語料,即使透過人工整理,過去所整理的資料還是有不正確的地方,這裡先經由前處理對資料都進一步的整理,然後標記時會產生歧義性的問題,我們提出了三種方法來進行標記,來解決歧義問題,並透過條件隨機場作為序列標記模型,同時配合名詞表、規則去做預先標記。透過實驗,去對未處理過的地方志語料做實體名詞辨識,辨識人名準確率皆可達到80%以上,另外再地名辨識部分可達到86%,能有如此好的辨識效果主因在於整理好的地方志語料與實驗語料之間敘述及記錄方式相似度是相當高的。運用標記的結果,試著用簡易的方法來做連結人名與地名資訊的實驗,找出語料中的人地名關聯資料,取樣作人工驗證,取樣結果說明我們的方法能有效的連結特定語法下的人名與地名;為了在未來的研究中,能夠做更深入的研究,嘗試從文本中切割出人物條目,運用地方志已知的特性,配合有限狀態機模型來判斷是否為條目開頭,雖能找出部分開頭,但會有許多遺漏狀況。
在未來的研究中,試著加入更多類型的標記,並做更完善的標記設計,讓辨識效果能有更多的提升,同時為了抓出更精確的人物資訊,除了嘗試段落切割、斷句之外,將試著做地方志的語法分析,確實的抓出語法結構來做人物與其他實體名詞的連結,自動化去整理出更完善的人物資訊。
zh_TW
dc.description.abstract (摘要) Difangzhi is the local gazetteers compiled by local government of China. Its content is plenty and extensive. It’s including many undetected information, like biographical information, geographical information, and officer record information and so on. Because of the difference between Difangzhi corpus and modern Chinese language, we should not use current natural language processing tools directly. In order to extract biographical information, we construct our model to recognize the named entity and use the noun list to assist our annotation method in Difangzhi corpus.
In this study, we use supervised learning to construct our model. At first, we need to generate our training data. According to the personal information list with manual verification and noun lists, we have reliable information to annotate words in Difangzhi corpus. However, they still have some noise in those lists. As a result, we must do the preprocessing to those lists for cleaning. After, the ambiguity problem will happen when we trying to annotate our corpus. Here we provide three methods to annotate our corpus with disambiguation. Using the annotated corpus to generate training data and built the condition random fields models. In our experiment, we use our models generated by three different annotate methods to predict the character label in testing Difangzhi corpus. According to the labeled result, we extract the person name and address name to evaluate. The result shows the precision of person name recognition is over 80%, and precision of address name recognition is about 86%. Because of the training corpus and test corpus is quite similar, the performances of our model is pretty well. Therefore, we use labeled result to find correlation of person name and address name. Using a simple way to connect person name and address name and sampling the result to evaluate. The sample result shows we could connect person name and address name correctly in some specific grammars. In order to analyze more deeply, we attempt to split clauses in Difangzhi corpus. Use finite state machine model to recognize the beginning of clauses. Although the result shows we could find some beginning of clauses, but our method still lose many beginning of clauses.
In the future work, we attempt to add more information to annotate Difangzhi corpus and modify our disambiguated methods to make the recognition result better. In order to get more information about the person in the corpus, we will try to split paragraphs or sentences more precisely. Besides, we also try to analyze grammar in the corpus. Finding useful pattern to connect person name and other entities, like address name, officer name and so on. Generating the information about people appears in the corpus automatically.
en_US
dc.description.tableofcontents 第一章 緒論 1
1.1 研究背景與動機 1
1.2 研究方法 2
1.3 主要貢獻 3
1.4 論文架構 3
第二章 文獻探討 4
2.1 數位人文之相關研究 4
2.2 實體名詞辨識之相關研究 5
第三章 地方志文本 7
3.1 地方志書影 7
3.2 地方志數位化 8
3.3 地方志史料數量與問題定義 9
第四章 系統架構 11
4.1 系統流程 11
4.2 實體名詞辨識模型訓練工具 13
第五章 產生標記語料 14
5.1 IOB format 結合 BIES標記類別 14
5.2 資料前處理 16
5.2.1 刪除地方志未用標記 16
5.2.2 卷分割地方志語料 16
5.2.3 異體字補充 16
5.2.4 原始名詞表內容比對及分析 18
5.2.5 標記規則 19
5.3 解歧標記方法 21
5.3.1 順序標記法 21
5.3.2 列舉式標記法 23
5.3.3 列舉式標記法加入朝代資訊 27
第六章 建立實體名詞辨識模型 31
6.1 隨機條件場 31
6.2 特徵擷取 33
第七章 實驗結果與分析 43
7.1 評估實體名詞辨識模型實驗 43
7.1.1 人名擷取實驗 43
7.1.2 地名擷取實驗 47
7.2 辨識結果應用之實驗 50
7.2.1 人名與地名連結 50
7.2.2 人物條目切割 55
第八章 結論與未來展望 59
7.1 結論 59
7.2 未來展望 60
參考文獻 62
附錄I 論文口試相關討論 65
zh_TW
dc.format.extent 2320683 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0102753029en_US
dc.subject (關鍵詞) 文字探勘zh_TW
dc.subject (關鍵詞) 實體名詞辨識zh_TW
dc.subject (關鍵詞) 機器學習zh_TW
dc.subject (關鍵詞) 數位人文zh_TW
dc.subject (關鍵詞) Text Miningen_US
dc.subject (關鍵詞) Named Entity Recognitionen_US
dc.subject (關鍵詞) Machine Learningen_US
dc.subject (關鍵詞) Digital Humanityen_US
dc.title (題名) 應用序列標記技術於地方志的實體名詞辨識zh_TW
dc.title (題名) Named Entity Recognition in Difangzhi Using Sequential Labeling Techniquesen_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) [1] 中國歷代人物傳記資料庫,http://projects.iq.harvard.edu/chinesecbdb/home [last visited 2016/7/26] 。
[2] 地方志介紹,http://baike.baidu.com/view/143397.htm [last visited 2016/6/17]。
[3] 杜協昌,半自動詞彙擷取:簡化的詞夾子方法以及JavaScript元件開發及應用,第六屆數位典藏與數位人文國際研討會論文集,391-418,2015。
[4] 金觀濤、邱偉雲、劉昭麟,「共現」詞頻分析及其運用-以「華人」觀念起源為例,第三屆數位典藏與數位人文國際研討會論文集,199-223,2011。
[5] 異體字介紹,https://zh.wikipedia.org/wiki/異體字 [last visited 2016/6/18]。
[6] 異體字整理表,http://www.china-language.gov.cn/wenziguifan [last visited 2016/6/18]。
[7] 張尚斌,詞夾子演算法在專有名詞辨識上的應用──以歷史文件為例,國立台灣大學,碩士論文,2006。
[8] 陳叔倬、李其原、C. Isett、S. Morgan,18世紀中國常民的身高分布、營養、與福利-初步分析報告,第三屆數位典藏與數位人文國際研討會論文集,83-93,2011。
[9] 彭維謙、劉士綱、杜協昌、翁稷安、項潔,自動擷取中文典籍中人名之嘗試 ──以PMI斷詞於《資治通鑑》的應用為例,數位人文研究與技藝,國立台灣大學出版中心,139-163,2012。
[10] 劉吉軒、柯雲娥、張惠真、譚修雯、黃瑞期、甯格致,以文本分析呈現臺灣海外史料政治思想輪廓,第三屆數位典藏與數位人文國際研討會論文集,169-198,2011。
[11] K. Black, Sampling and Sampling Distributions, Business Statistics for Contemporary Decision Making, 216-241, Wiley, 2009.

[12] K.-J. Chen and S.-H. Liu, Word Identification for Mandarin Chinese Sentences, Proceedings of International Conference on Computational Linguistics, 101-107, 1992.
[13] I. S. Dhillon and D. S. Modha, Concept Decompositions for Large Sparse Text Data Using Clustering, Journal of Machine Learning, 42(1-2), 143-175, 2001.
[14] R. Grossma, G. Seni, J. Elder, N. Agawal and H. Liu, Model Complexity, Model Selection and Regularization, Ensemble Methods in Data Mining, Improving Accuracy Through Combining Predictions, 21-38, Morgan and Claypool, 2010.
[15] R. Grishman and B. Sundheim, Sixth Message Understanding Conference: A Brief History, Proceedings of the 16th Conference on Computational linguistics, 466-471, 1996.
[16] J. Lafferty, A. McCallum and F. Pereira, Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, Proceedings of the 8th international conference on machine learning, 282-289, 2001.
[17] C.-L. Liu, G.-T. Jin, Q.-F. Liu, W.-Y. Chiu and Y.-S. Yu, Some Chances and Challenges in Applying Language Technologies to Historical Studies in Chinese, Journal of Computational Linguistics and Chinese Language Processing, 16(2), 27‒46, 2011.
[18] A. K. McCallum, MALLET: A Machine Learning for Language Toolkit, http://mallet.cs.umass.edu, 2002.
[19] W.-H. Pang, S.-P. Chen and H. Cheng, Extracting Posting Data from Chinese Local Monographs, Proceedings of International Conference of Digital Archives and Digital Humanities, 94-116, 2012.
[20] C. Sutton and A. McCallum, An Introduction to Conditional Random Fields for Relational Learning, Introduction to Statistical Relational Learning, 93-127, MIT Press, 2006.
[21] X.-G. Wang and M. Inaba, Structures and Evolution of Digital Humanities: An Empirical Research based on Correspondence Analysis and Co-word Analysis, Proceedings of International Conference of Digital Archives and Digital Humanities, 1-16, 2009.
[22] Y.-H. Wu, J. Zhao, B. Xu and H. Yu, Chinese Named Entity Recognition Based on Multiple Features, Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, 427–434, 2005.
[23] H.-P. Zhang and Q. Liu, Model of Chinese Words Rough Segmentation Based on N-Shortest Paths Method, Journal of Chinese Information Processing, 1-7, 2002.
[24] H.-P. Zhang, Q. Liu and H.-K. Yu, Chinese Named Entity Recognition Using Role Model, Journal of Computational Linguistics and Chinese Language Processing, 8(2), 29-60, 2003.
[25] Y. Zhai, Z. Rasheed and M. Shah, Conversation Detection in Feature Films Using Finite State Machines, Proceedings of 17th International Conference on Pattern Recognition, 458-461, 2004.
zh_TW