Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

題名 唐代墓誌銘與中國佛教寺廟志斷句研究
Sentence Segmentation for Tomb Biographies of Tang Dynasty and Chinese Buddhist Temple Gazetteers
作者 張逸
Chang, Yi
貢獻者 劉昭麟
Liu, Chao-Lin
張逸
Chang, Yi
關鍵詞 深度學習
機器學習
自然語言處理
Deep learning
Machine learning
Natural language processing
日期 2018
上傳時間 3-Sep-2018 15:52:15 (UTC+8)
摘要 20世紀以前,中文書寫並沒有使用標點符號的習慣,閱讀時必須憑個人經驗和語感對文章進行斷句理解。由於個人的經驗和習慣的不同,往往會對文章造成對不一樣的解讀甚至是誤解,因此,斷句是理解文章最基礎且困難的第一步驟。因此過去學者通過正規表示式、機器學習、深度學習等不同的方法作為自動化文言文斷句的方式,減少文史專家處理斷句的時間。
     儘管目前已有許多自動斷句的研究,卻尚未出現一個系統將其整合並達到最佳的斷句效果。因此本研究設計一套實驗流程,將過去的研究成果進行組合測試,並觀察在不同組合測試下的Precision、Recall、F1等評估指標找出最佳的組合,進一步減少處理斷句的時間。
     關於實驗流程的設計,以「唐代墓誌銘」以及「中國佛教寺廟志」作為實驗語料,並且使用「條件隨機場(Conditional Random Fields, CRF)」以及「Long Short-Term Memory(LSTM)」兩種在過去自動斷句研究中表現良好的模型與配合前後文特徵作為baseline,進行進一步的特徵與模型相關的組合實驗。特徵相關的實驗是藉由在baseline中加入各種不同的特徵找出有用的項目,而模型相關的實驗觀察不同機器學習方法與模型訓練方法建找出能夠增進模型效果的項目。
     在本研究的實驗結果中,效果最好的特徵是前後文以及斷詞統計量,而效果最好的模型是整合了CRF與LSTM所產生的模型CRF+LSTM,其中CRF加入了弱點補強的演算法增強其效果,最後在唐代墓誌銘以及中國佛教寺廟志兩個語料中作為評估指標的F1值分別達到了0.873以及0.675。
Prior to the 20th century, using punctuation in articles hasn`t become a total phenomenon. Therefore readers have to comprehend passages through their personal experiences and the notion to the context, which caused challenges to decode articles accurately due to individual differences. Thus, the punctuation is a difficult first step towards the understanding of articles.
     Although plenty research has been done, a fully optimized performance automatic punctuation system is still yet to come. In search of the best optimized combination of auto-punctuation system, this research designed an experiment protocol which testing various combination of evaluation index, e.g., Precision, Recall, F1 and previous research data.
     The experiment protocol was using “Tomb Biographies of Tang Dynasty” and “Chinese Buddhist Temple Gazetteers” as text corpus, in which the Conditional Random Fields (CRF) and the Long Short-Term Memory (LSTM), favorited and well-performed models in the past research, was applied as a baseline for conducting further experiment of the combination of feature and model. For the feature related experiment was extracting valid entry via adding various item entry in baseline; the model related experiment was enhancing model performance by observing various machine learning and model training methods.
     The results of the study shows that the best performed feature was the context and statistic of word segmentation. As for the best model was the combination of CRF and LSTM, the CRF+LSTM, in which the shortcoming of algorithm in CRF was patched as enhancement. As the result, the F1 score of both text corpuses: “Tomb Biographies of Tang Dynasty” and “Chinese Buddhist Temple Gazetteers” were reached 0.873 and 0.675.
參考文獻 [1]王博立、史曉東、蘇勁松,一種基於循環神經網路的文言文斷句方法,北京大學學報第53卷第2期,2017。
     [2]周紹良,《唐代墓誌彙編》,上海古籍出版社。
     [3]孫茂松、肖明等,基於無指導學習策略的無詞表條件下的漢語自動分詞,計算機學報第27卷第6期,2004。
     [4]張開旭、夏云慶、宇航,基於條件隨機場的古漢語自動斷句與標點方法,清華大學學報,2009。
     [5]彭維謙,自動擷取中文典籍中人名之嘗試 ── 以 PMI(Pointwise Mutual Information) 斷詞於《資治通鑑》的應用為例,國立台灣大學,資訊工程所,碩士論文;指導教授:項潔,2012。
     [6]黃建年、侯漢清,農業古籍斷句標點模式研究,中文信息學報,2008。
     [7]黃致凱,應用序列標記技術於地方志的實體名詞辨識,國立政治大學,資訊科學學系,碩士論文;指導教授:劉昭麟,2016。
     [8]黃瀚萱,以序列標記法解決古漢語斷句問題,國立交通大學,資訊工程學系,碩士論文;指導教授:孫春在,2008。
     [9]蘭和群,文言文斷句與翻譯技巧,河南師範大學學報哲學社會科學版,2005。
     [10]Ethem Alpaydin, Introduction to Machine Learning (2nd ed.). The MIT Press. 489-493, 2010.
     [11]Kenneth Church, William Gale, Patrick Hanks, Donald Hindle, Using Statistics in Lexical Analysis, Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon, 1991.
     [12]Junyoung Chung, Caglar Gulcehre and KyungHyun Cho, Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling, arXiv:1412.3555, 2014.
     [13]Hen-Hsen Huang, Chuen-Tsai Sun, and Hsin-Hsi Chen,Classical Chinese Sentence Segmentation, CIPS-SIGHAN Joint Conference on Chinese Language Processing, 2010.
     [14]Ho, Tin Kam, Random Forest, Proceedings of the 3rd International Conference on Document Analysis and Recognition, 1995.
     [15]Mikhail Korobov, sklearn-crfsuite, https://sklearn-crfsuite.readthedocs.io/, 2015.
     [16]J. Lafferty, A. McCallum and F. Pereira, Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, Proceedings of the 8th international conference on machine learning, 282-289, 2001.
     [17]R. Rojas, AdaBoost and the Super Bowl of Classifiers: A Tutorial Introduction to Adaptive Boosting, 3-5, 2009.
     [18]Ilya Sutskever, Oriol Vinyals and Quoc V. Le, Sequence to Sequence Learning with Neural Networks, Advances in Neural Information Processing Systems 27, NIPS 2014.
     [19]Yushi Yao and Zheng Huang, Bi-directional LSTM Recurrent Neural Network for Chinese Word Segmentation, arXiv preprint arXiv:1602.04874, 2016.
描述 碩士
國立政治大學
資訊科學系
104753032
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0104753032
資料類型 thesis
dc.contributor.advisor 劉昭麟zh_TW
dc.contributor.advisor Liu, Chao-Linen_US
dc.contributor.author (Authors) 張逸zh_TW
dc.contributor.author (Authors) Chang, Yien_US
dc.creator (作者) 張逸zh_TW
dc.creator (作者) Chang, Yien_US
dc.date (日期) 2018en_US
dc.date.accessioned 3-Sep-2018 15:52:15 (UTC+8)-
dc.date.available 3-Sep-2018 15:52:15 (UTC+8)-
dc.date.issued (上傳時間) 3-Sep-2018 15:52:15 (UTC+8)-
dc.identifier (Other Identifiers) G0104753032en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/119910-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊科學系zh_TW
dc.description (描述) 104753032zh_TW
dc.description.abstract (摘要) 20世紀以前,中文書寫並沒有使用標點符號的習慣,閱讀時必須憑個人經驗和語感對文章進行斷句理解。由於個人的經驗和習慣的不同,往往會對文章造成對不一樣的解讀甚至是誤解,因此,斷句是理解文章最基礎且困難的第一步驟。因此過去學者通過正規表示式、機器學習、深度學習等不同的方法作為自動化文言文斷句的方式,減少文史專家處理斷句的時間。
     儘管目前已有許多自動斷句的研究,卻尚未出現一個系統將其整合並達到最佳的斷句效果。因此本研究設計一套實驗流程,將過去的研究成果進行組合測試,並觀察在不同組合測試下的Precision、Recall、F1等評估指標找出最佳的組合,進一步減少處理斷句的時間。
     關於實驗流程的設計,以「唐代墓誌銘」以及「中國佛教寺廟志」作為實驗語料,並且使用「條件隨機場(Conditional Random Fields, CRF)」以及「Long Short-Term Memory(LSTM)」兩種在過去自動斷句研究中表現良好的模型與配合前後文特徵作為baseline,進行進一步的特徵與模型相關的組合實驗。特徵相關的實驗是藉由在baseline中加入各種不同的特徵找出有用的項目,而模型相關的實驗觀察不同機器學習方法與模型訓練方法建找出能夠增進模型效果的項目。
     在本研究的實驗結果中,效果最好的特徵是前後文以及斷詞統計量,而效果最好的模型是整合了CRF與LSTM所產生的模型CRF+LSTM,其中CRF加入了弱點補強的演算法增強其效果,最後在唐代墓誌銘以及中國佛教寺廟志兩個語料中作為評估指標的F1值分別達到了0.873以及0.675。
zh_TW
dc.description.abstract (摘要) Prior to the 20th century, using punctuation in articles hasn`t become a total phenomenon. Therefore readers have to comprehend passages through their personal experiences and the notion to the context, which caused challenges to decode articles accurately due to individual differences. Thus, the punctuation is a difficult first step towards the understanding of articles.
     Although plenty research has been done, a fully optimized performance automatic punctuation system is still yet to come. In search of the best optimized combination of auto-punctuation system, this research designed an experiment protocol which testing various combination of evaluation index, e.g., Precision, Recall, F1 and previous research data.
     The experiment protocol was using “Tomb Biographies of Tang Dynasty” and “Chinese Buddhist Temple Gazetteers” as text corpus, in which the Conditional Random Fields (CRF) and the Long Short-Term Memory (LSTM), favorited and well-performed models in the past research, was applied as a baseline for conducting further experiment of the combination of feature and model. For the feature related experiment was extracting valid entry via adding various item entry in baseline; the model related experiment was enhancing model performance by observing various machine learning and model training methods.
     The results of the study shows that the best performed feature was the context and statistic of word segmentation. As for the best model was the combination of CRF and LSTM, the CRF+LSTM, in which the shortcoming of algorithm in CRF was patched as enhancement. As the result, the F1 score of both text corpuses: “Tomb Biographies of Tang Dynasty” and “Chinese Buddhist Temple Gazetteers” were reached 0.873 and 0.675.
en_US
dc.description.tableofcontents 第1章 緒論 1
     1.1 研究背景與動機 1
     1.2 問題描述 2
     1.3 研究目標 2
     1.4 主要貢獻 2
     1.5 論文架構 4
     第2章 相關研究 5
     2.1 正規表示式 5
     2.2 機器學習 5
     2.3 深度學習 6
     2.4 整合學習 6
     第3章 語料及系統架構 7
     3.1 語料來源及前處理 7
     3.2 唐代墓誌銘 8
     3.3 中國佛教寺廟志 10
     3.4 系統架構 11
     第4章 數據集建立 13
     4.1 文字標記 13
     4.2 前後文特徵 13
     4.3 斷詞統計量特徵 14
     4.3.1 t-test difference 14
     4.3.2 Pointwise Mutual Information 15
     4.4 聲韻特徵 15
     4.5 詞表標記特徵 17
     4.6 正規表示式修正 18
     4.6.1 詞表修正方法 18
     4.6.2 長官職修正 18
     4.7 格式轉換 19
     4.7.1 字嵌入 19
     4.7.2 數值轉字串 21
     4.7.3 字串轉數值 21
     第5章 模型建立與評估 23
     5.1 CRF模型 23
     5.2 LSTM模型 24
     5.3 Sequence to sequence LSTM 27
     5.4 CRF整合學習模型 30
     5.4.1 CRF-Bagging 30
     5.4.2 CRF-Boosting 31
     5.4.3 CRF整合學習的機率值輸出 34
     5.4.4 CRF整合學習的輸出調整 34
     5.4.5 CRF的參數優化 35
     5.5 CRF+LSTM模型 35
     5.6 模型評估 36
     第6章 實驗設計 37
     6.1 機器學習工具 37
     6.2 資料格式 39
     6.3 baseline設定 41
     6.4 實驗流程 41
     6.5 基礎設定驗證 43
     6.5.1 自動標點實驗 43
     6.5.2 資料格式實驗 45
     6.5.3 單向、雙向LSTM實驗 46
     第7章 唐代墓誌銘實驗結果分析 47
     7.1 斷句模型選擇 47
     7.2 前後文範圍實驗 48
     7.3 輔助特徵選擇 50
     7.3.1 斷詞統計量 50
     7.3.2 詞表標記 51
     7.3.3 聲韻 53
     7.4 模型的資料量需求 56
     7.5 CRF的整合學習 56
     7.6 LSTM的結構調整 57
     7.6.1 字嵌入的維度效果 57
     7.6.2 LSTM模型層數 58
     7.6.3 Sequence to sequence 59
     7.7 CRF與LSTM的最佳整合 60
     7.8 唐代墓誌銘詞表修正結果 61
     第8章 中國佛教寺廟志實驗結果分析 63
     8.1 斷句模型選擇 63
     8.2 前後文範圍實驗 64
     8.3 輔助特徵選擇 65
     8.3.1 斷詞統計量 65
     8.3.2 詞表標記 66
     8.3.3 聲韻 67
     8.4 模型的資料量需求 68
     8.5 CRF的整合學習 69
     8.6 LSTM的結構調整 70
     8.6.1 字嵌入的維度效果 70
     8.6.2 比較LSTM層數效果 71
     8.6.3 Sequence to sequence 71
     8.7 CRF與LSTM的最佳整合 72
     8.8 中國佛教寺廟志詞表修正結果 73
     第9章 結論及未來展望 75
     參考文獻 76
     附錄 A 論文口試相關討論 78
zh_TW
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0104753032en_US
dc.subject (關鍵詞) 深度學習zh_TW
dc.subject (關鍵詞) 機器學習zh_TW
dc.subject (關鍵詞) 自然語言處理zh_TW
dc.subject (關鍵詞) Deep learningen_US
dc.subject (關鍵詞) Machine learningen_US
dc.subject (關鍵詞) Natural language processingen_US
dc.title (題名) 唐代墓誌銘與中國佛教寺廟志斷句研究zh_TW
dc.title (題名) Sentence Segmentation for Tomb Biographies of Tang Dynasty and Chinese Buddhist Temple Gazetteersen_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) [1]王博立、史曉東、蘇勁松,一種基於循環神經網路的文言文斷句方法,北京大學學報第53卷第2期,2017。
     [2]周紹良,《唐代墓誌彙編》,上海古籍出版社。
     [3]孫茂松、肖明等,基於無指導學習策略的無詞表條件下的漢語自動分詞,計算機學報第27卷第6期,2004。
     [4]張開旭、夏云慶、宇航,基於條件隨機場的古漢語自動斷句與標點方法,清華大學學報,2009。
     [5]彭維謙,自動擷取中文典籍中人名之嘗試 ── 以 PMI(Pointwise Mutual Information) 斷詞於《資治通鑑》的應用為例,國立台灣大學,資訊工程所,碩士論文;指導教授:項潔,2012。
     [6]黃建年、侯漢清,農業古籍斷句標點模式研究,中文信息學報,2008。
     [7]黃致凱,應用序列標記技術於地方志的實體名詞辨識,國立政治大學,資訊科學學系,碩士論文;指導教授:劉昭麟,2016。
     [8]黃瀚萱,以序列標記法解決古漢語斷句問題,國立交通大學,資訊工程學系,碩士論文;指導教授:孫春在,2008。
     [9]蘭和群,文言文斷句與翻譯技巧,河南師範大學學報哲學社會科學版,2005。
     [10]Ethem Alpaydin, Introduction to Machine Learning (2nd ed.). The MIT Press. 489-493, 2010.
     [11]Kenneth Church, William Gale, Patrick Hanks, Donald Hindle, Using Statistics in Lexical Analysis, Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon, 1991.
     [12]Junyoung Chung, Caglar Gulcehre and KyungHyun Cho, Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling, arXiv:1412.3555, 2014.
     [13]Hen-Hsen Huang, Chuen-Tsai Sun, and Hsin-Hsi Chen,Classical Chinese Sentence Segmentation, CIPS-SIGHAN Joint Conference on Chinese Language Processing, 2010.
     [14]Ho, Tin Kam, Random Forest, Proceedings of the 3rd International Conference on Document Analysis and Recognition, 1995.
     [15]Mikhail Korobov, sklearn-crfsuite, https://sklearn-crfsuite.readthedocs.io/, 2015.
     [16]J. Lafferty, A. McCallum and F. Pereira, Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, Proceedings of the 8th international conference on machine learning, 282-289, 2001.
     [17]R. Rojas, AdaBoost and the Super Bowl of Classifiers: A Tutorial Introduction to Adaptive Boosting, 3-5, 2009.
     [18]Ilya Sutskever, Oriol Vinyals and Quoc V. Le, Sequence to Sequence Learning with Neural Networks, Advances in Neural Information Processing Systems 27, NIPS 2014.
     [19]Yushi Yao and Zheng Huang, Bi-directional LSTM Recurrent Neural Network for Chinese Word Segmentation, arXiv preprint arXiv:1602.04874, 2016.
zh_TW
dc.identifier.doi (DOI) 10.6814/THE.NCCU.CS.022.2018.B02en_US