基於主動式學習之古漢語斷句系統發展與應用研究

學術產出-學位論文

文章檢視/開啟

pdf(0)

書目匯出

Google Scholar^TM

政大圖書館

學術資源探索系統

引文資訊

TAIR相關學術產出

Simple Record
Full Record

題名	基於主動式學習之古漢語斷句系統發展與應用研究 Development and Application of An Ancient Chinese Sentence Segmentation System Based on Active Learning
作者	徐志帆 Hsu, Chih-Fan
貢獻者	陳志銘 Chen, Chih-Ming 徐志帆 Hsu, Chih-Fan
關鍵詞	數位人文主動學習機器學習自動化古漢語斷句人機互動 digital humanities active learning machine learning automatic ancient Chinese sentence segmentation human-computer interaction
日期	2019
上傳時間	7-八月-2019 16:26:24 (UTC+8)
摘要	本研究旨在開發支援數位人文研究之「基於主動式學習的古漢語文本斷句系統」，結合主動學習與機器學習演算法，透過人機合作模式降低建立自動化古漢語斷句建立模型時所需的訓練語料，並協助人文學者面對未解讀過的文獻能更有效率的進行斷句判讀作業。為了找出最合適建立「基於主動式學習的古漢語文本斷句系統」的的演算法與特徵模板，本研究設計第一個實驗採用了不同的演算法與特徵模板配合依序文本和主動學習兩種選擇文本方法所建立的斷句模型進行比較。實驗結果發現，條件隨機場(conditional random fields)與三字詞特徵模板在主動學習方法中能有效地進行學習，適合發展「主動學習斷句模式」。第二個實驗邀請人文專長領域的學者使用「基於主動式學習的古漢語文本斷句系統」進行古漢語文本的斷句判讀，以人文學者各自標註資料建立的斷句模型進行比較分析，並輔以半結構式訪談深度了解人文學者對於本研究發展之系統輔以斷句的使用感受與建議。實驗結果發現「基於主動式學習的古漢語文本斷句系統」確實能有效學習人文學者的斷句標註資料，並且模型預測能力能基於人機合作而不斷提升。此外，分析過程中發現模型的斷句預測能力與人文學者的標註種類比和相鄰字種類比有顯著負相關。最後，透過訪談結果歸納得知人文學者對於系統操作流程與介面具有正面評價，多數受訪者認為本系統的斷句預測功能在古漢語斷句上能提供有效之輔助功能。未來可考量增加命名實體模型或其他古漢語規則的特徵模板設計，以進一步提升斷句預測能力，也希冀能將發展的系統運用在人文領域教育上，發展為訓練古漢語斷句之數位人文教育平台。 This study aims to develop an “Ancient Chinese Sentence Segmentation System Based on Active Learning” for supporting digital humanities research, combine active learning and machine learning algorithms, reduce training corpora required for establishing an automatic ancient Chinese sentence segmentation model through human-computer cooperation model, and assist humanists in efficient sentence segmentation interpretation when facing literatures which have not been interpreted. To find out the most suitable algorithm and feature template for establishing the “Ancient Chinese Sentence Segmentation System Based on Active Learning”, the sentence segmentation models established by applying different algorithms and feature templates matched with sequential text and active learning are compared in the first experiment in this study. The experimental results reveal that conditional random fields and three-word feature templates could effectively precede learning in active learning that they are suitable for developing an “active learning sentence segmentation model”. Humanities researchers are invited to use the “Ancient Chinese Sentence Segmentation System Based on Active Learning” for the sentence segmentation interpretation of ancient Chinese texts. Sentence segmentation model established by individual humanist’s annotation data are compared and analyzed, and semi-structured interview is used for deeply understanding humanists’ use perception of sentence segmentation with the system developed in this study and suggestions. The experimental results show that the “Ancient Chinese Sentence Segmentation System Based on Active Learning” could effectively learn humanists’ sentence segmentation annotation data and the prediction ability of the model, based on human-computer cooperation, could be constantly promoted. Significantly negative correlations between sentence segmentation prediction ability and humanists’ annotation type ratio and adjacent word type ratio are discovered in the analysis process. According to the interviews, humanists present positive evaluation on the system operation process and interface. Most respondents consider that the sentence segmentation prediction function of the system could provide effective assistance in ancient Chinese sentence segmentation. Naming solid model or other feature template design with ancient Chinese rules could be increased to further promote the sentence segmentation prediction ability. It is also expected to apply the developed system to humanities education and develop the digital humanities education platform for training ancient Chinese sentence segmentation.
參考文獻	中文部分牛紅廣 (2014)。關於古籍數字化性質及開發的思考。圖書館, (2), 107-108. 王力 (1976)。古漢語通論 (Vol. 2)。中外出版社。王丹。(2010)。古籍數字化與古典文學研究。社科縱橫，2,98-99。李鐸、王毅(2005)。關於古代文獻信息化工程與古典文學研究之間互動關係的對話。文學遺產，1，126-137。李響、才藏太、姜文斌、呂雅娟、劉群(2011)。最大熵和規則相結合的藏文句子邊界識別方法。中文信息學報，25(4)，39-45。林爾正、林丹紅(2007)。計算機應用於古籍整理研究概況。情報探索，2007(6)，28-29。梁喜濤、顧磊 (2015)。基於分層選擇策略的主動學習分詞方法。計算機應用研究，32(5)，1353-1356。張逸(2018)。唐代墓誌銘與中國佛教寺廟志斷句研究。國立政治大學，臺北市。張開旭、夏雲慶、宇航(2009)。基於條件隨機場的古漢語自動斷句與標點方法。清華大學學報: 自然科學版，(10)，1733-1736。黃瀚萱、孫春在(2007)。以序列標記方法解決古漢語斷句問題。國立交通大學，新竹市。黃水清、王東波(2017)。古文信息處理研究的現狀及趨勢。圖書情報工作， 61(12)，43-49. 葉智豪、王盟鈞、蔡宗翰(2011)。歷史文獻的命名實體描顯取一結合主動學習法之半監督式模型. 從保存到創造: 開啟數位人文研究。 1，131。楊樹達(1963)。古書句讀釋例。中華書局。趙敏俐、杜曉勤(2013)。國學大數據時代來了。光明日報，09-16。潘德利(2002)。中國古籍數字化進程和展望。圖書情報工作，46(7)， 117-120。兰和群(2005)。古文断句与翻译技巧。河南师范大学学报: 哲学社会科学版， 32(3)，120-121。顧磊、趙陽(2016)。古籍數字化標註資源建設的意義及其現狀分析。圖書館學研究，(4)，49-52。劉康、錢旭、王自強(2012)。主動學習算法綜述。計算機工程與應用，48(34)，1-4。劉瀏、王東波、黃水清(2017)。機器學習視角的人工智能研究回顧及對圖書情報學的影響。圖書與情報，37（06），84-95。西文部分 Graves, A. Supervised sequence labelling with recurrent neural networks. 2012. ISBN 9783642212703. URL http://books. google. com/books. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780. Hu, Y. (2016). Classical Chinese Sentence Segmentation as Sequence Labeling. Li, S., Zhou, G., & Huang, C. R. (2012). Active learning for Chinese word segmentation. Proceedings of COLING 2012: Posters, 683-692. Lewis, D. D., & Gale, W. A. (1994, August). A sequential algorithm for training text classifiers. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 3-12). Springer-Verlag New York, Inc.. Krishnakumar, A. (2007). Active learning literature survey. Technical reports, University of California, Santa Cruz. 42. Olsson, F. (2009). A literature survey of active machine learning in the context of natural language processing. Seung, H. S., Opper, M., & Sompolinsky, H. (1992, July). Query by committee. In Proceedings of the fifth annual workshop on Computational learning theory (pp. 287-294). ACM. Settles, B., & Craven, M. (2008, October). An analysis of active learning strategies for sequence labeling tasks. In Proceedings of the conference on empirical methods in natural language processing (pp. 1070-1079). Association for Computational Linguistics. Settles, B. (2012). Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 6(1), 1-114. Sutton, C., & McCallum, A. (2012). An introduction to conditional random fields. Foundations and Trends® in Machine Learning, 4(4), 267-373. Wang, B., Shi, X. and Su, J. (2017). A sentence segmentation method for ancient Chinese texts based on recurrent neural network. Acta Scientiarum Naturalium Universitatis Pekinensis, 53(2):255‒261. (in Chinese) Wang, B., Shi, X., Tan, Z., Chen, Y. and Wang, W. (2016). A sentence segmentation method for ancient Chinese texts based on NNLM. Proceedings of the Chinese Lexical Semantics Workshop 2016, Lecture Notes in Computer Science 10085, pp. 387–396. Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. (2016). Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann.
描述	碩士國立政治大學圖書資訊與檔案學研究所 106155007
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0106155007
資料類型	thesis

dc.contributor.advisor	陳志銘	zh_TW
dc.contributor.advisor	Chen, Chih-Ming	en_US
dc.contributor.author (作者)	徐志帆	zh_TW
dc.contributor.author (作者)	Hsu, Chih-Fan	en_US
dc.creator (作者)	徐志帆	zh_TW
dc.creator (作者)	Hsu, Chih-Fan	en_US
dc.date (日期)	2019	en_US
dc.date.accessioned	7-八月-2019 16:26:24 (UTC+8)	-
dc.date.available	7-八月-2019 16:26:24 (UTC+8)	-
dc.date.issued (上傳時間)	7-八月-2019 16:26:24 (UTC+8)	-
dc.identifier (其他識別碼)	G0106155007	en_US
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/124825	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	圖書資訊與檔案學研究所	zh_TW
dc.description (描述)	106155007	zh_TW
dc.description.abstract (摘要)	本研究旨在開發支援數位人文研究之「基於主動式學習的古漢語文本斷句系統」，結合主動學習與機器學習演算法，透過人機合作模式降低建立自動化古漢語斷句建立模型時所需的訓練語料，並協助人文學者面對未解讀過的文獻能更有效率的進行斷句判讀作業。為了找出最合適建立「基於主動式學習的古漢語文本斷句系統」的的演算法與特徵模板，本研究設計第一個實驗採用了不同的演算法與特徵模板配合依序文本和主動學習兩種選擇文本方法所建立的斷句模型進行比較。實驗結果發現，條件隨機場(conditional random fields)與三字詞特徵模板在主動學習方法中能有效地進行學習，適合發展「主動學習斷句模式」。第二個實驗邀請人文專長領域的學者使用「基於主動式學習的古漢語文本斷句系統」進行古漢語文本的斷句判讀，以人文學者各自標註資料建立的斷句模型進行比較分析，並輔以半結構式訪談深度了解人文學者對於本研究發展之系統輔以斷句的使用感受與建議。實驗結果發現「基於主動式學習的古漢語文本斷句系統」確實能有效學習人文學者的斷句標註資料，並且模型預測能力能基於人機合作而不斷提升。此外，分析過程中發現模型的斷句預測能力與人文學者的標註種類比和相鄰字種類比有顯著負相關。最後，透過訪談結果歸納得知人文學者對於系統操作流程與介面具有正面評價，多數受訪者認為本系統的斷句預測功能在古漢語斷句上能提供有效之輔助功能。未來可考量增加命名實體模型或其他古漢語規則的特徵模板設計，以進一步提升斷句預測能力，也希冀能將發展的系統運用在人文領域教育上，發展為訓練古漢語斷句之數位人文教育平台。	zh_TW
dc.description.abstract (摘要)	This study aims to develop an “Ancient Chinese Sentence Segmentation System Based on Active Learning” for supporting digital humanities research, combine active learning and machine learning algorithms, reduce training corpora required for establishing an automatic ancient Chinese sentence segmentation model through human-computer cooperation model, and assist humanists in efficient sentence segmentation interpretation when facing literatures which have not been interpreted. To find out the most suitable algorithm and feature template for establishing the “Ancient Chinese Sentence Segmentation System Based on Active Learning”, the sentence segmentation models established by applying different algorithms and feature templates matched with sequential text and active learning are compared in the first experiment in this study. The experimental results reveal that conditional random fields and three-word feature templates could effectively precede learning in active learning that they are suitable for developing an “active learning sentence segmentation model”. Humanities researchers are invited to use the “Ancient Chinese Sentence Segmentation System Based on Active Learning” for the sentence segmentation interpretation of ancient Chinese texts. Sentence segmentation model established by individual humanist’s annotation data are compared and analyzed, and semi-structured interview is used for deeply understanding humanists’ use perception of sentence segmentation with the system developed in this study and suggestions. The experimental results show that the “Ancient Chinese Sentence Segmentation System Based on Active Learning” could effectively learn humanists’ sentence segmentation annotation data and the prediction ability of the model, based on human-computer cooperation, could be constantly promoted. Significantly negative correlations between sentence segmentation prediction ability and humanists’ annotation type ratio and adjacent word type ratio are discovered in the analysis process. According to the interviews, humanists present positive evaluation on the system operation process and interface. Most respondents consider that the sentence segmentation prediction function of the system could provide effective assistance in ancient Chinese sentence segmentation. Naming solid model or other feature template design with ancient Chinese rules could be increased to further promote the sentence segmentation prediction ability. It is also expected to apply the developed system to humanities education and develop the digital humanities education platform for training ancient Chinese sentence segmentation.	en_US
dc.description.tableofcontents	目次 vi 表目錄 viii 圖目錄 x 第一章緒論 11 第一節研究背景與動機 11 第二節研究目的 14 第三節研究問題 14 第四節研究限制與範圍 15 第五節名詞解釋 15 第二章文獻探討 19 第一節古漢語文本的傳統斷句標註 19 第二節古漢語文本的自動斷句 21 第三節主動式學習框架 26 第三章研究設計與方法 31 第一節基於主動學習的古漢語文本斷句系統建模流程 31 第二節資料處理階段 34 第三節斷句模型系統工具 36 第四節系統介面與功能 39 第五節各回合斷句資料的模型建立與評估 43 第六節單回合建立未解讀資料的斷句標註 46 第七節整體斷句結果之平均F-measure和F-measure變化斜率評估 48 第八節特徵模板與演算法之評估實驗設計 49 第九節基於主動學習的古漢語文本斷句系統評估實驗設計 51 第四章實驗結果與分析 57 第一節特徵模板與演算法之平均F‐measure與變化斜率比較分析 57 第二節主動專家組的古漢語斷句結果分析 60 第三節主動專家組的訪談分析 73 第五章結論與建議 87 第一節結論 87 第二節系統改善建議 89 第三節未來研究方向 92 參考文獻 95	zh_TW
dc.format.extent	1866099 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0106155007	en_US
dc.subject (關鍵詞)	數位人文	zh_TW
dc.subject (關鍵詞)	主動學習	zh_TW
dc.subject (關鍵詞)	機器學習	zh_TW
dc.subject (關鍵詞)	自動化古漢語斷句	zh_TW
dc.subject (關鍵詞)	人機互動	zh_TW
dc.subject (關鍵詞)	digital humanities	en_US
dc.subject (關鍵詞)	active learning	en_US
dc.subject (關鍵詞)	machine learning	en_US
dc.subject (關鍵詞)	automatic ancient Chinese sentence segmentation	en_US
dc.subject (關鍵詞)	human-computer interaction	en_US
dc.title (題名)	基於主動式學習之古漢語斷句系統發展與應用研究	zh_TW
dc.title (題名)	Development and Application of An Ancient Chinese Sentence Segmentation System Based on Active Learning	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	中文部分牛紅廣 (2014)。關於古籍數字化性質及開發的思考。圖書館, (2), 107-108. 王力 (1976)。古漢語通論 (Vol. 2)。中外出版社。王丹。(2010)。古籍數字化與古典文學研究。社科縱橫，2,98-99。李鐸、王毅(2005)。關於古代文獻信息化工程與古典文學研究之間互動關係的對話。文學遺產，1，126-137。李響、才藏太、姜文斌、呂雅娟、劉群(2011)。最大熵和規則相結合的藏文句子邊界識別方法。中文信息學報，25(4)，39-45。林爾正、林丹紅(2007)。計算機應用於古籍整理研究概況。情報探索，2007(6)，28-29。梁喜濤、顧磊 (2015)。基於分層選擇策略的主動學習分詞方法。計算機應用研究，32(5)，1353-1356。張逸(2018)。唐代墓誌銘與中國佛教寺廟志斷句研究。國立政治大學，臺北市。張開旭、夏雲慶、宇航(2009)。基於條件隨機場的古漢語自動斷句與標點方法。清華大學學報: 自然科學版，(10)，1733-1736。黃瀚萱、孫春在(2007)。以序列標記方法解決古漢語斷句問題。國立交通大學，新竹市。黃水清、王東波(2017)。古文信息處理研究的現狀及趨勢。圖書情報工作， 61(12)，43-49. 葉智豪、王盟鈞、蔡宗翰(2011)。歷史文獻的命名實體描顯取一結合主動學習法之半監督式模型. 從保存到創造: 開啟數位人文研究。 1，131。楊樹達(1963)。古書句讀釋例。中華書局。趙敏俐、杜曉勤(2013)。國學大數據時代來了。光明日報，09-16。潘德利(2002)。中國古籍數字化進程和展望。圖書情報工作，46(7)， 117-120。兰和群(2005)。古文断句与翻译技巧。河南师范大学学报: 哲学社会科学版， 32(3)，120-121。顧磊、趙陽(2016)。古籍數字化標註資源建設的意義及其現狀分析。圖書館學研究，(4)，49-52。劉康、錢旭、王自強(2012)。主動學習算法綜述。計算機工程與應用，48(34)，1-4。劉瀏、王東波、黃水清(2017)。機器學習視角的人工智能研究回顧及對圖書情報學的影響。圖書與情報，37（06），84-95。西文部分 Graves, A. Supervised sequence labelling with recurrent neural networks. 2012. ISBN 9783642212703. URL http://books. google. com/books. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780. Hu, Y. (2016). Classical Chinese Sentence Segmentation as Sequence Labeling. Li, S., Zhou, G., & Huang, C. R. (2012). Active learning for Chinese word segmentation. Proceedings of COLING 2012: Posters, 683-692. Lewis, D. D., & Gale, W. A. (1994, August). A sequential algorithm for training text classifiers. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 3-12). Springer-Verlag New York, Inc.. Krishnakumar, A. (2007). Active learning literature survey. Technical reports, University of California, Santa Cruz. 42. Olsson, F. (2009). A literature survey of active machine learning in the context of natural language processing. Seung, H. S., Opper, M., & Sompolinsky, H. (1992, July). Query by committee. In Proceedings of the fifth annual workshop on Computational learning theory (pp. 287-294). ACM. Settles, B., & Craven, M. (2008, October). An analysis of active learning strategies for sequence labeling tasks. In Proceedings of the conference on empirical methods in natural language processing (pp. 1070-1079). Association for Computational Linguistics. Settles, B. (2012). Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 6(1), 1-114. Sutton, C., & McCallum, A. (2012). An introduction to conditional random fields. Foundations and Trends® in Machine Learning, 4(4), 267-373. Wang, B., Shi, X. and Su, J. (2017). A sentence segmentation method for ancient Chinese texts based on recurrent neural network. Acta Scientiarum Naturalium Universitatis Pekinensis, 53(2):255‒261. (in Chinese) Wang, B., Shi, X., Tan, Z., Chen, Y. and Wang, W. (2016). A sentence segmentation method for ancient Chinese texts based on NNLM. Proceedings of the Chinese Lexical Semantics Workshop 2016, Lecture Notes in Computer Science 10085, pp. 387–396. Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. (2016). Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann.	zh_TW
dc.identifier.doi (DOI)	10.6814/NCCU201900543	en_US

學術產出-學位論文

文章檢視/開啟

書目匯出

Google ScholarTM

政大圖書館

引文資訊

TAIR相關學術產出

Google Scholar^TM