Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

題名 基於英文維基百科之文字蘊涵
Text Entailment based on English Wikipedia
作者 林柏誠
Lin, Po Cheng
貢獻者 劉昭麟
Liu , Chao Lin
林柏誠
Lin, Po Cheng
關鍵詞 自然語言處理
Nature Language Processing
日期 2014
上傳時間 5-Jan-2015 11:22:29 (UTC+8)
摘要 近年來文字蘊涵研究在自然語言處理中逐漸受到重視,從2005年Recognizing Textual Entailment (RTE)舉辦英文語料相關評比開始,越來越多人開始投入文字蘊涵的相關研究,而NII Testbeds and Community for information access Research(NTCIR) 也從第九屆開始舉辦Recognizing Inference in Text(RITE) 的相關評比,除了英文語料以外,亦包含繁體中文、簡體中文以及日文等等的語料,開始引起亞洲地區相關研究者的關注參加。
本研究以文字蘊涵技術為基底,透過維基百科,判斷任一論述句其含義是與事實相符,或與事實違背,我們依據論述句的語文資訊,在維基百科中找出與論述句相關的文章,並從中尋找有無相關的句子,支持或反對該論述句的論點,藉以判斷其結果。
我們將本系統大致分成了三個程序,第一步是先從維基百科中擷取與論述句的相關文章,接著我們從相關文章中擷取與論述句有關聯的相關句,最後則是從找出的相關句中,判別那些相關句是支持還是反對該論述句,並透過Linearly Weighted Functions(LWFs) 藉以判別每個相關特徵的權重和各項推論的門檻值,期許透過上述的方法以及各項有效的語言特徵,能夠推論出論述句的真實與否。
In recent years, the research of textual entailment is getting more important in Natural Language Processing. Since Recognizing Textual Entailment (RTE) began to hold the contest of English corpus in 2005, more and more people start to engage in the related research. Besides, NTCIR ninth has held the related task Recognizing Inference in Text (RITE) in Chinese, Japanese, and others languages corpus. Therefore it has gradually attracted Asian people to focus on this area.
In this paper, we based on the skill of textual entailment. Trying to validate any of input sentences which are truth or against to the fact. According to the language information in input sentences, we extract the related articles on Wikipedia. Then, we extract the related sentences from those articles and recognizing them which are support or against the input sentence. Hence, we can use that information to validate the input sentences.
Our system is roughly departed into three parts. First is extract related articles from Wikipedia, second is extract related sentences from related articles. The last is validate those sentences which are support or against the input sentence. We also adopt Linear Weight Functions (LWFs) to adjust every features parameters and entailment’s threshold. By the information and useful language features above, we hope it can validate whether input sentences is truth or not.
參考文獻 [1]Adams, “Textual Entailment Through Extended Lexical Overlap,” Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment, pp. 128-133, 2006.
[2] BLEU, http://en.wikipedia.org/wiki/BLEU
[3] A. Budanitsky and G. Hirst, Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures, Workshop on WordNet and Other Lexical Resources, Second Meeting of the North American Chapter of the Association for Computational Linguistics, Pittsburgh, Pennsylvania, USA, 2001.
[4] S. Cohen and N. Or, "A general algorithm for subtree similarity-search," Data Engineering (ICDE), IEEE 30th International Conference. pp. 928-939, 2014.
[5] Grid search, http://scikit-learn.org/stable/modules/grid_search.html
[6] S. Hattori and S. Sato, “Team SKL’s Strategy and Experience in RITE2,” Proceedings of the 10th NTCIR Conference, pp. 435-442, 2013.
[7] A. Hickl, J. Bensley, J. Williams, K. Roberts, B. Rink, and Y. Shi, “Recognizing Textual Entailment with LCC’s GROUNDHOG System,” Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment, pp. 80-85, 2006.
[8] Heuristic function, http://en.wikipedia.org/wiki/Heuristic_function
63
[9] W.-J. Huang and C.-L. Liu, “NCCU-MIG at NTCIR-10: Using Lexical, Syntactic, and Semantic Features for the RITE Tasks,” Proceedings of the 10th NTCIR Conference, pp. 430-434, 2013.
[10] G. Li, X. Liu, J. Feng, and L. Zhou, “Efficient Similarity Search for Tree-Structured Data, Author Affiliations: Department of Computer Science and Technology,” Proceedings of the 20th Scientific and Statistical Database Management Conference, pp. 131-149, 2008.
[11] Linearly Weighted Functions, http://en.wikipedia.org/wiki/Weight_function
[12] Longest Common Strings, http://en.wikipedia.org/wiki/Longest_common_substring_problem
[13] Lucene, http://lucene.apache.org/core/
[14] Named Entity Recognition, http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html
[15] NTCIR RITE-VAL, http://research.nii.ac.jp/ntcir/index-en.html
[16] RTE, http://research.microsoft.com/en-us/groups/nlp/rte.aspx
[17] S. Rasoul and D. Landgrebe, “A Survey of Decision Tree Classifier Methodology,” IEEE Transactions on Systems, Man, and Cybernetics, Vol. 21, No. 3, pp 660-674, May 1991.
[18] Stanford Corenlp , http://nlp.stanford.edu/software/corenlp.shtml
[19] Stanford Named Entity Recognizer, http://www-nlp.stanford.edu/software/CRF-NER.shtml
64
[20] Stanford Parser, http://nlp.stanford.edu/software/lex-parser.shtml
[21] Stanford Typed Dependencies, http://nlp.stanford.edu/software/stanford-dependencies.shtml
[22] SVM, http://en.wikipedia.org/wiki/Support_vector_machine
[23] Textual Entailment , http://en.wikipedia.org/wiki/Textual_entailment
[24] Total commander, http://www.ghisler.com/
[25] Wikipedia, http://en.wikipedia.org/wiki/Main_Page
[26] WordNet, http://wordnet.princeton.edu/
[27] S.-H. Wu, S.-S. Yang, L.-P. Chen, H.-S. Chiu, and R.-D. Yang, “CYUT Chinese Textual Entailment Recognition System for NTCIR-10 RITE-2.” Proceedings of the 10th NTCIR Conference, pp. 443-448, 2013.
[28] S.-H. Wu, W.-C. Huang, L.-P. Chen, and T. Ku, “Binary-class and Multi-class Chinese Textural Entailment System Description in NTCIR-9 RITE,” Proceedings of the 9th NTCIR Conference, pp. 422-426, 2011.
[29] Y. Y. Zhang, J. Xu, C.-L. Liu, X.-L. Wang, R.-F. Xu, Q.-C. Chen, X. Wang, Y.-S. Hou, and B. Tang, “ICRC_HITSZ at RITE: Leveraging Multiple Classifiers Voting for Textual Entailment Recognition,” Proceedings of the 9th NTCIR Conference, pp. 325-329, 2011.
描述 碩士
國立政治大學
資訊科學學系
101753028
103
資料來源 http://thesis.lib.nccu.edu.tw/record/#G1017530281
資料類型 thesis
dc.contributor.advisor 劉昭麟zh_TW
dc.contributor.advisor Liu , Chao Linen_US
dc.contributor.author (Authors) 林柏誠zh_TW
dc.contributor.author (Authors) Lin, Po Chengen_US
dc.creator (作者) 林柏誠zh_TW
dc.creator (作者) Lin, Po Chengen_US
dc.date (日期) 2014en_US
dc.date.accessioned 5-Jan-2015 11:22:29 (UTC+8)-
dc.date.available 5-Jan-2015 11:22:29 (UTC+8)-
dc.date.issued (上傳時間) 5-Jan-2015 11:22:29 (UTC+8)-
dc.identifier (Other Identifiers) G1017530281en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/72556-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊科學學系zh_TW
dc.description (描述) 101753028zh_TW
dc.description (描述) 103zh_TW
dc.description.abstract (摘要) 近年來文字蘊涵研究在自然語言處理中逐漸受到重視,從2005年Recognizing Textual Entailment (RTE)舉辦英文語料相關評比開始,越來越多人開始投入文字蘊涵的相關研究,而NII Testbeds and Community for information access Research(NTCIR) 也從第九屆開始舉辦Recognizing Inference in Text(RITE) 的相關評比,除了英文語料以外,亦包含繁體中文、簡體中文以及日文等等的語料,開始引起亞洲地區相關研究者的關注參加。
本研究以文字蘊涵技術為基底,透過維基百科,判斷任一論述句其含義是與事實相符,或與事實違背,我們依據論述句的語文資訊,在維基百科中找出與論述句相關的文章,並從中尋找有無相關的句子,支持或反對該論述句的論點,藉以判斷其結果。
我們將本系統大致分成了三個程序,第一步是先從維基百科中擷取與論述句的相關文章,接著我們從相關文章中擷取與論述句有關聯的相關句,最後則是從找出的相關句中,判別那些相關句是支持還是反對該論述句,並透過Linearly Weighted Functions(LWFs) 藉以判別每個相關特徵的權重和各項推論的門檻值,期許透過上述的方法以及各項有效的語言特徵,能夠推論出論述句的真實與否。
zh_TW
dc.description.abstract (摘要) In recent years, the research of textual entailment is getting more important in Natural Language Processing. Since Recognizing Textual Entailment (RTE) began to hold the contest of English corpus in 2005, more and more people start to engage in the related research. Besides, NTCIR ninth has held the related task Recognizing Inference in Text (RITE) in Chinese, Japanese, and others languages corpus. Therefore it has gradually attracted Asian people to focus on this area.
In this paper, we based on the skill of textual entailment. Trying to validate any of input sentences which are truth or against to the fact. According to the language information in input sentences, we extract the related articles on Wikipedia. Then, we extract the related sentences from those articles and recognizing them which are support or against the input sentence. Hence, we can use that information to validate the input sentences.
Our system is roughly departed into three parts. First is extract related articles from Wikipedia, second is extract related sentences from related articles. The last is validate those sentences which are support or against the input sentence. We also adopt Linear Weight Functions (LWFs) to adjust every features parameters and entailment’s threshold. By the information and useful language features above, we hope it can validate whether input sentences is truth or not.
en_US
dc.description.tableofcontents 第1章緒論 ............................................................................................................................1
1.1 研究背景與動機 .........................................................................................................1
1.2 方法概述 .....................................................................................................................2
1.3 主要貢獻 .....................................................................................................................2
1.4 論文架構 .....................................................................................................................3
第2章文獻回顧....................................................................................................................4
2.1 文字蘊涵相關研究 .....................................................................................................4
2.2 RTE與RITE評比相關研究.......................................................................................5
第3章語料及辭典介紹........................................................................................................7
3.1 語料集 .........................................................................................................................7
3.2 英文維基百科 .............................................................................................................9
3.3 WordNet........................................................................................................................9
第4章研究方法..................................................................................................................10
4.1 擷取相關文章及相關句 ...........................................................................................10
4.1.1 擷取相關文章 ................................................................................................ 11
4.1.2 擷取相關文章 ................................................................................................16
4.2 相關度計算 ...............................................................................................................18
4.2.1 相關句權重 ....................................................................................................19
4.2.2 文章權重 ........................................................................................................25
4.2.3 相關句綜合權重 ............................................................................................26
iv
4.3 推論驗證系統 ...........................................................................................................26
4.3.1 語文特徵介紹 ................................................................................................27
4.3.2 LWFs公式與參數訓練方法..........................................................................39
第5章系統效能評估..........................................................................................................42
5.1 Linearly Weighted Functions參數及門檻值介紹.....................................................42
5.2實驗結果與討論........................................................................................................44
第6章利用資訊檢索方法採取小規模實驗設計..............................................................51
6.1 方法概述 ...................................................................................................................51
6.2 語料介紹 ...................................................................................................................52
6.3 實驗結果 ...................................................................................................................54
第7章結論與未來展望......................................................................................................55
7.1結論..………………………………………………………………………………...59
7.2未來展望....................................................................................................................60
參考文獻..................................................................................................................................62
附錄相關文章與相關句範例................................................................................................
zh_TW
dc.format.extent 1074690 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G1017530281en_US
dc.subject (關鍵詞) 自然語言處理zh_TW
dc.subject (關鍵詞) Nature Language Processingen_US
dc.title (題名) 基於英文維基百科之文字蘊涵zh_TW
dc.title (題名) Text Entailment based on English Wikipediaen_US
dc.type (資料類型) thesisen
dc.relation.reference (參考文獻) [1]Adams, “Textual Entailment Through Extended Lexical Overlap,” Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment, pp. 128-133, 2006.
[2] BLEU, http://en.wikipedia.org/wiki/BLEU
[3] A. Budanitsky and G. Hirst, Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures, Workshop on WordNet and Other Lexical Resources, Second Meeting of the North American Chapter of the Association for Computational Linguistics, Pittsburgh, Pennsylvania, USA, 2001.
[4] S. Cohen and N. Or, "A general algorithm for subtree similarity-search," Data Engineering (ICDE), IEEE 30th International Conference. pp. 928-939, 2014.
[5] Grid search, http://scikit-learn.org/stable/modules/grid_search.html
[6] S. Hattori and S. Sato, “Team SKL’s Strategy and Experience in RITE2,” Proceedings of the 10th NTCIR Conference, pp. 435-442, 2013.
[7] A. Hickl, J. Bensley, J. Williams, K. Roberts, B. Rink, and Y. Shi, “Recognizing Textual Entailment with LCC’s GROUNDHOG System,” Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment, pp. 80-85, 2006.
[8] Heuristic function, http://en.wikipedia.org/wiki/Heuristic_function
63
[9] W.-J. Huang and C.-L. Liu, “NCCU-MIG at NTCIR-10: Using Lexical, Syntactic, and Semantic Features for the RITE Tasks,” Proceedings of the 10th NTCIR Conference, pp. 430-434, 2013.
[10] G. Li, X. Liu, J. Feng, and L. Zhou, “Efficient Similarity Search for Tree-Structured Data, Author Affiliations: Department of Computer Science and Technology,” Proceedings of the 20th Scientific and Statistical Database Management Conference, pp. 131-149, 2008.
[11] Linearly Weighted Functions, http://en.wikipedia.org/wiki/Weight_function
[12] Longest Common Strings, http://en.wikipedia.org/wiki/Longest_common_substring_problem
[13] Lucene, http://lucene.apache.org/core/
[14] Named Entity Recognition, http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html
[15] NTCIR RITE-VAL, http://research.nii.ac.jp/ntcir/index-en.html
[16] RTE, http://research.microsoft.com/en-us/groups/nlp/rte.aspx
[17] S. Rasoul and D. Landgrebe, “A Survey of Decision Tree Classifier Methodology,” IEEE Transactions on Systems, Man, and Cybernetics, Vol. 21, No. 3, pp 660-674, May 1991.
[18] Stanford Corenlp , http://nlp.stanford.edu/software/corenlp.shtml
[19] Stanford Named Entity Recognizer, http://www-nlp.stanford.edu/software/CRF-NER.shtml
64
[20] Stanford Parser, http://nlp.stanford.edu/software/lex-parser.shtml
[21] Stanford Typed Dependencies, http://nlp.stanford.edu/software/stanford-dependencies.shtml
[22] SVM, http://en.wikipedia.org/wiki/Support_vector_machine
[23] Textual Entailment , http://en.wikipedia.org/wiki/Textual_entailment
[24] Total commander, http://www.ghisler.com/
[25] Wikipedia, http://en.wikipedia.org/wiki/Main_Page
[26] WordNet, http://wordnet.princeton.edu/
[27] S.-H. Wu, S.-S. Yang, L.-P. Chen, H.-S. Chiu, and R.-D. Yang, “CYUT Chinese Textual Entailment Recognition System for NTCIR-10 RITE-2.” Proceedings of the 10th NTCIR Conference, pp. 443-448, 2013.
[28] S.-H. Wu, W.-C. Huang, L.-P. Chen, and T. Ku, “Binary-class and Multi-class Chinese Textural Entailment System Description in NTCIR-9 RITE,” Proceedings of the 9th NTCIR Conference, pp. 422-426, 2011.
[29] Y. Y. Zhang, J. Xu, C.-L. Liu, X.-L. Wang, R.-F. Xu, Q.-C. Chen, X. Wang, Y.-S. Hou, and B. Tang, “ICRC_HITSZ at RITE: Leveraging Multiple Classifiers Voting for Textual Entailment Recognition,” Proceedings of the 9th NTCIR Conference, pp. 325-329, 2011.
zh_TW