學術產出-學位論文
文章檢視/開啟
書目匯出
-
題名 基於主動式學習之古漢語斷句系統發展與應用研究
Development and Application of An Ancient Chinese Sentence Segmentation System Based on Active Learning作者 徐志帆
Hsu, Chih-Fan貢獻者 陳志銘
Chen, Chih-Ming
徐志帆
Hsu, Chih-Fan關鍵詞 數位人文
主動學習
機器學習
自動化古漢語斷句
人機互動
digital humanities
active learning
machine learning
automatic ancient Chinese sentence segmentation
human-computer interaction日期 2019 上傳時間 7-八月-2019 16:26:24 (UTC+8) 摘要 本研究旨在開發支援數位人文研究之「基於主動式學習的古漢語文本斷句系統」,結合主動學習與機器學習演算法,透過人機合作模式降低建立自動化古漢語斷句建立模型時所需的訓練語料,並協助人文學者面對未解讀過的文獻能更有效率的進行斷句判讀作業。為了找出最合適建立「基於主動式學習的古漢語文本斷句系統」的的演算法與特徵模板,本研究設計第一個實驗採用了不同的演算法與特徵模板配合依序文本和主動學習兩種選擇文本方法所建立的斷句模型進行比較。實驗結果發現,條件隨機場(conditional random fields)與三字詞特徵模板在主動學習方法中能有效地進行學習,適合發展「主動學習斷句模式」。第二個實驗邀請人文專長領域的學者使用「基於主動式學習的古漢語文本斷句系統」進行古漢語文本的斷句判讀,以人文學者各自標註資料建立的斷句模型進行比較分析,並輔以半結構式訪談深度了解人文學者對於本研究發展之系統輔以斷句的使用感受與建議。 實驗結果發現「基於主動式學習的古漢語文本斷句系統」確實能有效學習人文學者的斷句標註資料,並且模型預測能力能基於人機合作而不斷提升。此外,分析過程中發現模型的斷句預測能力與人文學者的標註種類比和相鄰字種類比有顯著負相關。最後,透過訪談結果歸納得知人文學者對於系統操作流程與介面具有正面評價,多數受訪者認為本系統的斷句預測功能在古漢語斷句上能提供有效之輔助功能。未來可考量增加命名實體模型或其他古漢語規則的特徵模板設計,以進一步提升斷句預測能力,也希冀能將發展的系統運用在人文領域教育上,發展為訓練古漢語斷句之數位人文教育平台。
This study aims to develop an “Ancient Chinese Sentence Segmentation System Based on Active Learning” for supporting digital humanities research, combine active learning and machine learning algorithms, reduce training corpora required for establishing an automatic ancient Chinese sentence segmentation model through human-computer cooperation model, and assist humanists in efficient sentence segmentation interpretation when facing literatures which have not been interpreted. To find out the most suitable algorithm and feature template for establishing the “Ancient Chinese Sentence Segmentation System Based on Active Learning”, the sentence segmentation models established by applying different algorithms and feature templates matched with sequential text and active learning are compared in the first experiment in this study. The experimental results reveal that conditional random fields and three-word feature templates could effectively precede learning in active learning that they are suitable for developing an “active learning sentence segmentation model”.Humanities researchers are invited to use the “Ancient Chinese Sentence Segmentation System Based on Active Learning” for the sentence segmentation interpretation of ancient Chinese texts. Sentence segmentation model established by individual humanist’s annotation data are compared and analyzed, and semi-structured interview is used for deeply understanding humanists’ use perception of sentence segmentation with the system developed in this study and suggestions. The experimental results show that the “Ancient Chinese Sentence Segmentation System Based on Active Learning” could effectively learn humanists’ sentence segmentation annotation data and the prediction ability of the model, based on human-computer cooperation, could be constantly promoted. Significantly negative correlations between sentence segmentation prediction ability and humanists’ annotation type ratio and adjacent word type ratio are discovered in the analysis process. According to the interviews, humanists present positive evaluation on the system operation process and interface. Most respondents consider that the sentence segmentation prediction function of the system could provide effective assistance in ancient Chinese sentence segmentation. Naming solid model or other feature template design with ancient Chinese rules could be increased to further promote the sentence segmentation prediction ability. It is also expected to apply the developed system to humanities education and develop the digital humanities education platform for training ancient Chinese sentence segmentation.參考文獻 中文部分牛紅廣 (2014)。關於古籍數字化性質及開發的思考。圖書館, (2), 107-108.王力 (1976)。 古漢語通論 (Vol. 2)。中外出版社。王丹。(2010)。古籍數字化與古典文學研究。社科縱橫,2,98-99。李鐸、王毅(2005)。關於古代文獻信息化工程與古典文學研究之間互動關係的對話。文學遺產,1,126-137。李響、才藏太、姜文斌、呂雅娟、劉群(2011)。最大熵和規則相結合的藏文句子邊界識別方法。中文信息學報,25(4),39-45。林爾正、林丹紅(2007)。 計算機應用於古籍整理研究概況。 情報探索,2007(6),28-29。梁喜濤、顧磊 (2015)。 基於分層選擇策略的主動學習分詞方法。計算機應用研究,32(5),1353-1356。張逸(2018)。唐代墓誌銘與中國佛教寺廟志斷句研究。國立政治大學,臺北市。張開旭、夏雲慶、宇航(2009)。基於條件隨機場的古漢語自動斷句與標點方法。清華大學學報: 自然科學版,(10),1733-1736。黃瀚萱、孫春在(2007)。以序列標記方法解決古漢語斷句問題。國立交通大學,新竹市。黃水清、王東波(2017)。古文信息處理研究的現狀及趨勢。圖書情報工作, 61(12),43-49.葉智豪、王盟鈞、蔡宗翰(2011)。歷史文獻的命名實體描顯取一結合主動學習法之半監督式模型. 從保存到創造: 開啟數位人文研究。 1,131。楊樹達(1963)。古書句讀釋例。 中華書局。趙敏俐、杜曉勤(2013)。國學大數據時代來了。光明日報,09-16。潘德利(2002)。中國古籍數字化進程和展望。 圖書情報工作,46(7), 117-120。兰和群(2005)。古文断句与翻译技巧。 河南师范大学学报: 哲学社会科学版, 32(3),120-121。顧磊、趙陽(2016)。古籍數字化標註資源建設的意義及其現狀分析。圖書館學研究,(4),49-52。劉康、錢旭、王自強(2012)。主動學習算法綜述。 計算機工程與應用,48(34),1-4。劉瀏、王東波、黃水清(2017)。機器學習視角的人工智能研究回顧及對圖書情報學的影響。圖書與情報,37(06),84-95。西文部分Graves, A. Supervised sequence labelling with recurrent neural networks. 2012. ISBN 9783642212703. URL http://books. google. com/books.Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.Hu, Y. (2016). Classical Chinese Sentence Segmentation as Sequence Labeling.Li, S., Zhou, G., & Huang, C. R. (2012). Active learning for Chinese word segmentation. Proceedings of COLING 2012: Posters, 683-692.Lewis, D. D., & Gale, W. A. (1994, August). A sequential algorithm for training text classifiers. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 3-12). Springer-Verlag New York, Inc..Krishnakumar, A. (2007). Active learning literature survey. Technical reports, University of California, Santa Cruz. 42.Olsson, F. (2009). A literature survey of active machine learning in the context of natural language processing.Seung, H. S., Opper, M., & Sompolinsky, H. (1992, July). Query by committee. In Proceedings of the fifth annual workshop on Computational learning theory (pp. 287-294). ACM.Settles, B., & Craven, M. (2008, October). An analysis of active learning strategies for sequence labeling tasks. In Proceedings of the conference on empirical methods in natural language processing (pp. 1070-1079). Association for Computational Linguistics.Settles, B. (2012). Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 6(1), 1-114.Sutton, C., & McCallum, A. (2012). An introduction to conditional random fields. Foundations and Trends® in Machine Learning, 4(4), 267-373.Wang, B., Shi, X. and Su, J. (2017). A sentence segmentation method for ancient Chinese texts based on recurrent neural network. Acta Scientiarum Naturalium Universitatis Pekinensis, 53(2):255‒261. (in Chinese)Wang, B., Shi, X., Tan, Z., Chen, Y. and Wang, W. (2016). A sentence segmentation method for ancient Chinese texts based on NNLM. Proceedings of the Chinese Lexical Semantics Workshop 2016, Lecture Notes in Computer Science 10085, pp. 387–396.Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. (2016). Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann. 描述 碩士
國立政治大學
圖書資訊與檔案學研究所
106155007資料來源 http://thesis.lib.nccu.edu.tw/record/#G0106155007 資料類型 thesis dc.contributor.advisor 陳志銘 zh_TW dc.contributor.advisor Chen, Chih-Ming en_US dc.contributor.author (作者) 徐志帆 zh_TW dc.contributor.author (作者) Hsu, Chih-Fan en_US dc.creator (作者) 徐志帆 zh_TW dc.creator (作者) Hsu, Chih-Fan en_US dc.date (日期) 2019 en_US dc.date.accessioned 7-八月-2019 16:26:24 (UTC+8) - dc.date.available 7-八月-2019 16:26:24 (UTC+8) - dc.date.issued (上傳時間) 7-八月-2019 16:26:24 (UTC+8) - dc.identifier (其他 識別碼) G0106155007 en_US dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/124825 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 圖書資訊與檔案學研究所 zh_TW dc.description (描述) 106155007 zh_TW dc.description.abstract (摘要) 本研究旨在開發支援數位人文研究之「基於主動式學習的古漢語文本斷句系統」,結合主動學習與機器學習演算法,透過人機合作模式降低建立自動化古漢語斷句建立模型時所需的訓練語料,並協助人文學者面對未解讀過的文獻能更有效率的進行斷句判讀作業。為了找出最合適建立「基於主動式學習的古漢語文本斷句系統」的的演算法與特徵模板,本研究設計第一個實驗採用了不同的演算法與特徵模板配合依序文本和主動學習兩種選擇文本方法所建立的斷句模型進行比較。實驗結果發現,條件隨機場(conditional random fields)與三字詞特徵模板在主動學習方法中能有效地進行學習,適合發展「主動學習斷句模式」。第二個實驗邀請人文專長領域的學者使用「基於主動式學習的古漢語文本斷句系統」進行古漢語文本的斷句判讀,以人文學者各自標註資料建立的斷句模型進行比較分析,並輔以半結構式訪談深度了解人文學者對於本研究發展之系統輔以斷句的使用感受與建議。 實驗結果發現「基於主動式學習的古漢語文本斷句系統」確實能有效學習人文學者的斷句標註資料,並且模型預測能力能基於人機合作而不斷提升。此外,分析過程中發現模型的斷句預測能力與人文學者的標註種類比和相鄰字種類比有顯著負相關。最後,透過訪談結果歸納得知人文學者對於系統操作流程與介面具有正面評價,多數受訪者認為本系統的斷句預測功能在古漢語斷句上能提供有效之輔助功能。未來可考量增加命名實體模型或其他古漢語規則的特徵模板設計,以進一步提升斷句預測能力,也希冀能將發展的系統運用在人文領域教育上,發展為訓練古漢語斷句之數位人文教育平台。 zh_TW dc.description.abstract (摘要) This study aims to develop an “Ancient Chinese Sentence Segmentation System Based on Active Learning” for supporting digital humanities research, combine active learning and machine learning algorithms, reduce training corpora required for establishing an automatic ancient Chinese sentence segmentation model through human-computer cooperation model, and assist humanists in efficient sentence segmentation interpretation when facing literatures which have not been interpreted. To find out the most suitable algorithm and feature template for establishing the “Ancient Chinese Sentence Segmentation System Based on Active Learning”, the sentence segmentation models established by applying different algorithms and feature templates matched with sequential text and active learning are compared in the first experiment in this study. The experimental results reveal that conditional random fields and three-word feature templates could effectively precede learning in active learning that they are suitable for developing an “active learning sentence segmentation model”.Humanities researchers are invited to use the “Ancient Chinese Sentence Segmentation System Based on Active Learning” for the sentence segmentation interpretation of ancient Chinese texts. Sentence segmentation model established by individual humanist’s annotation data are compared and analyzed, and semi-structured interview is used for deeply understanding humanists’ use perception of sentence segmentation with the system developed in this study and suggestions. The experimental results show that the “Ancient Chinese Sentence Segmentation System Based on Active Learning” could effectively learn humanists’ sentence segmentation annotation data and the prediction ability of the model, based on human-computer cooperation, could be constantly promoted. Significantly negative correlations between sentence segmentation prediction ability and humanists’ annotation type ratio and adjacent word type ratio are discovered in the analysis process. According to the interviews, humanists present positive evaluation on the system operation process and interface. Most respondents consider that the sentence segmentation prediction function of the system could provide effective assistance in ancient Chinese sentence segmentation. Naming solid model or other feature template design with ancient Chinese rules could be increased to further promote the sentence segmentation prediction ability. It is also expected to apply the developed system to humanities education and develop the digital humanities education platform for training ancient Chinese sentence segmentation. en_US dc.description.tableofcontents 目次 vi表目錄 viii圖目錄 x第一章 緒論 11第一節 研究背景與動機 11第二節 研究目的 14第三節 研究問題 14第四節 研究限制與範圍 15第五節 名詞解釋 15第二章 文獻探討 19第一節 古漢語文本的傳統斷句標註 19第二節 古漢語文本的自動斷句 21第三節 主動式學習框架 26第三章 研究設計與方法 31第一節 基於主動學習的古漢語文本斷句系統建模流程 31第二節 資料處理階段 34第三節 斷句模型系統工具 36第四節 系統介面與功能 39第五節 各回合斷句資料的模型建立與評估 43第六節 單回合建立未解讀資料的斷句標註 46第七節 整體斷句結果之平均F-measure和F-measure變化斜率評估 48第八節 特徵模板與演算法之評估實驗設計 49第九節 基於主動學習的古漢語文本斷句系統評估實驗設計 51第四章 實驗結果與分析 57第一節 特徵模板與演算法之平均F‐measure與變化斜率比較分析 57第二節 主動專家組的古漢語斷句結果分析 60第三節 主動專家組的訪談分析 73第五章 結論與建議 87第一節 結論 87第二節 系統改善建議 89第三節 未來研究方向 92參考文獻 95 zh_TW dc.format.extent 1866099 bytes - dc.format.mimetype application/pdf - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0106155007 en_US dc.subject (關鍵詞) 數位人文 zh_TW dc.subject (關鍵詞) 主動學習 zh_TW dc.subject (關鍵詞) 機器學習 zh_TW dc.subject (關鍵詞) 自動化古漢語斷句 zh_TW dc.subject (關鍵詞) 人機互動 zh_TW dc.subject (關鍵詞) digital humanities en_US dc.subject (關鍵詞) active learning en_US dc.subject (關鍵詞) machine learning en_US dc.subject (關鍵詞) automatic ancient Chinese sentence segmentation en_US dc.subject (關鍵詞) human-computer interaction en_US dc.title (題名) 基於主動式學習之古漢語斷句系統發展與應用研究 zh_TW dc.title (題名) Development and Application of An Ancient Chinese Sentence Segmentation System Based on Active Learning en_US dc.type (資料類型) thesis en_US dc.relation.reference (參考文獻) 中文部分牛紅廣 (2014)。關於古籍數字化性質及開發的思考。圖書館, (2), 107-108.王力 (1976)。 古漢語通論 (Vol. 2)。中外出版社。王丹。(2010)。古籍數字化與古典文學研究。社科縱橫,2,98-99。李鐸、王毅(2005)。關於古代文獻信息化工程與古典文學研究之間互動關係的對話。文學遺產,1,126-137。李響、才藏太、姜文斌、呂雅娟、劉群(2011)。最大熵和規則相結合的藏文句子邊界識別方法。中文信息學報,25(4),39-45。林爾正、林丹紅(2007)。 計算機應用於古籍整理研究概況。 情報探索,2007(6),28-29。梁喜濤、顧磊 (2015)。 基於分層選擇策略的主動學習分詞方法。計算機應用研究,32(5),1353-1356。張逸(2018)。唐代墓誌銘與中國佛教寺廟志斷句研究。國立政治大學,臺北市。張開旭、夏雲慶、宇航(2009)。基於條件隨機場的古漢語自動斷句與標點方法。清華大學學報: 自然科學版,(10),1733-1736。黃瀚萱、孫春在(2007)。以序列標記方法解決古漢語斷句問題。國立交通大學,新竹市。黃水清、王東波(2017)。古文信息處理研究的現狀及趨勢。圖書情報工作, 61(12),43-49.葉智豪、王盟鈞、蔡宗翰(2011)。歷史文獻的命名實體描顯取一結合主動學習法之半監督式模型. 從保存到創造: 開啟數位人文研究。 1,131。楊樹達(1963)。古書句讀釋例。 中華書局。趙敏俐、杜曉勤(2013)。國學大數據時代來了。光明日報,09-16。潘德利(2002)。中國古籍數字化進程和展望。 圖書情報工作,46(7), 117-120。兰和群(2005)。古文断句与翻译技巧。 河南师范大学学报: 哲学社会科学版, 32(3),120-121。顧磊、趙陽(2016)。古籍數字化標註資源建設的意義及其現狀分析。圖書館學研究,(4),49-52。劉康、錢旭、王自強(2012)。主動學習算法綜述。 計算機工程與應用,48(34),1-4。劉瀏、王東波、黃水清(2017)。機器學習視角的人工智能研究回顧及對圖書情報學的影響。圖書與情報,37(06),84-95。西文部分Graves, A. Supervised sequence labelling with recurrent neural networks. 2012. ISBN 9783642212703. URL http://books. google. com/books.Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.Hu, Y. (2016). Classical Chinese Sentence Segmentation as Sequence Labeling.Li, S., Zhou, G., & Huang, C. R. (2012). Active learning for Chinese word segmentation. Proceedings of COLING 2012: Posters, 683-692.Lewis, D. D., & Gale, W. A. (1994, August). A sequential algorithm for training text classifiers. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 3-12). Springer-Verlag New York, Inc..Krishnakumar, A. (2007). Active learning literature survey. Technical reports, University of California, Santa Cruz. 42.Olsson, F. (2009). A literature survey of active machine learning in the context of natural language processing.Seung, H. S., Opper, M., & Sompolinsky, H. (1992, July). Query by committee. In Proceedings of the fifth annual workshop on Computational learning theory (pp. 287-294). ACM.Settles, B., & Craven, M. (2008, October). An analysis of active learning strategies for sequence labeling tasks. In Proceedings of the conference on empirical methods in natural language processing (pp. 1070-1079). Association for Computational Linguistics.Settles, B. (2012). Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 6(1), 1-114.Sutton, C., & McCallum, A. (2012). An introduction to conditional random fields. Foundations and Trends® in Machine Learning, 4(4), 267-373.Wang, B., Shi, X. and Su, J. (2017). A sentence segmentation method for ancient Chinese texts based on recurrent neural network. Acta Scientiarum Naturalium Universitatis Pekinensis, 53(2):255‒261. (in Chinese)Wang, B., Shi, X., Tan, Z., Chen, Y. and Wang, W. (2016). A sentence segmentation method for ancient Chinese texts based on NNLM. Proceedings of the Chinese Lexical Semantics Workshop 2016, Lecture Notes in Computer Science 10085, pp. 387–396.Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. (2016). Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann. zh_TW dc.identifier.doi (DOI) 10.6814/NCCU201900543 en_US