dc.contributor.advisor | 劉吉軒 | zh_TW |
dc.contributor.advisor | Jyi-Shane Liu | en_US |
dc.contributor.author (作者) | 翁嘉緯 | zh_TW |
dc.contributor.author (作者) | Chia-Wei Weng | en_US |
dc.creator (作者) | 翁嘉緯 | zh_TW |
dc.creator (作者) | Chia-Wei Weng | en_US |
dc.date (日期) | 2003 | en_US |
dc.date.accessioned | 17-九月-2009 13:53:20 (UTC+8) | - |
dc.date.available | 17-九月-2009 13:53:20 (UTC+8) | - |
dc.date.issued (上傳時間) | 17-九月-2009 13:53:20 (UTC+8) | - |
dc.identifier (其他 識別碼) | G0090753018 | en_US |
dc.identifier.uri (URI) | https://nccur.lib.nccu.edu.tw/handle/140.119/32628 | - |
dc.description (描述) | 碩士 | zh_TW |
dc.description (描述) | 國立政治大學 | zh_TW |
dc.description (描述) | 資訊科學學系 | zh_TW |
dc.description (描述) | 90753018 | zh_TW |
dc.description (描述) | 92 | zh_TW |
dc.description.abstract (摘要) | 隨著網際網路的蓬勃發展,資訊擷取(Information Extraction)已經成為一個非常重要的技術。資訊擷取的目標為從非結構化的文字資料中,為特定的主題整理出相關之結構化資訊,其所牽涉的問題,包括分析文件的內容,篩選、擷取出相關的文字及其對應的意義。到目前為止,大部份的資訊擷取系統都著重在英文文件上,對於中文文件資訊擷取技術的研究才正在如火如荼的展開,加上全世界至少超過1/5的人說中文,積極投入中文資訊擷取的研究就顯得非常重要。 中文的描述方式與英文有著很大的不同。在英文,詞跟詞之間有著明顯的『空白』,電腦可以很輕易的區隔輸入字串中每個詞。但是在中文,詞跟詞之間並沒有明顯的界限,一般的處理情形為利用詞典,將一個輸入字串中的文字,比對詞典內的詞來當做斷詞的依據,不過由於字組成詞的變化程度相當大,斷詞錯誤的情形仍很可能出現。因此,在本篇研究論文我們提出不做斷詞、不做詞性分析,而利用『型態辨識』的方法搭配『有限狀態自動機』的運作方式,來處理中文資訊擷取的問題。在實驗方面,我們以『總政府人事任免公報』當作測試資料,其精確度高達98%,而回收率也達到了97%。此外,我們也應用到其他不同的資料領域,對於建立跨領域之中文資訊擷取系統有了初步的研究進展,充分印證了本資訊擷取方法處理中文資訊擷取問題的可行性。 | zh_TW |
dc.description.abstract (摘要) | With the explosion of World Wide Web, information extraction has become a major technical area. The goal of information extraction is to transform non-structured text into structured data of specific topic. It involves analyzing, filtering and extracting relevant parts of text and the corresponding meaning. Most information extraction research mainly focuses on English text. On the other hand, research on Chinese information extraction has not received as much attention. Considering the fact that one-fifth population in the world are Chinese-speaking people, Chinese information extraction technology will become increasingly important. Chinese language is different with English in many aspects. In English, words are separated with space such that computers can easily distinguish each word in the input string. In Chinese, there are no spaces between characters to segment them into meaningful words. A general solution is to match characters of the input string to the words in the dictionary to find proper word boundary. Yet, much flexibility and ambiguity exist in the combination of characters into words. Many errors may occur in word segmentation. . In this thesis, we propose an approach to Chinese information extraction based on pattern matching and finite state automata, without relying on word segmentation and part-of-speech tagging. The approach was evaluated with “government personnel directives in official gazettes” as test data, and it achieved performance measure of 98% precision and 97% recall. Moreover, the approach was extended to other data domains. The results have showed initial progress on the research of multiple- domain Chinese information extraction system. | en_US |
dc.description.tableofcontents | 第一章 簡介………………………………………………………………1 1.1 背景………………………………………………………………1 1.2 研究動機…………………………………………………………3 1.3 研究目的…………………………………………………………4 1.4 研究方法…………………………………………………………4 1.5 論文架構與貢獻…………………………………………………7 第二章 文獻探討…………………………………………………………8 2.1 資訊擷取相關研究………………………………………………9 2.1.1 『資訊擷取』與『文件理解』……………………………10 2.1.2 『資訊擷取』與『資訊檢索』……………………………11 2.1.3 答詢系統…………………………………………………14 2.1.4 資訊擷取在MUC-7的五項子工作…………………………16 2.1.5 建立資訊擷取系統的兩種方式…………………………18 2.1.6 發展資訊擷取功能的三種技術…………………………21 2.2 以半結構化文件為主的資訊擷取技術…………………………23 2.2.1 WIEN………………………………………………………23 2.2.2 SoftMealy………………………………………………24 2.2.3 STALKER…………………………………………………………25 2.2.4 IEPAD……………………………………………………………26 2.3 以純文字文件為主的資訊擷取技術……………………………27 2.3.1 AutoSlog………………………………………………27 2.3.2 FASTUS…………………………………………………………39 2.3.3 總統府人事任免公報資訊擷取系統………………………………32 2.4 總結………………………………………………………………34 第三章 系統模型發展……………………………………………………35 3.1 以型態辨識擷取中文資訊之概念形成…………………………35 3.2 擷取型態中語意元素屬性的給定………………………………36 3.3 多層次中文資訊擷取……………………………………………40 3.4 系統執行工具-有限狀態自動機………………………………41 3.5 系統架構與執行…………………………………………………45 3.5.1 模板建立………………………………………………………45 3.5.2 系統執行說明…………………………………………………47 3.5.3 範例執行說明…………………………………………………49 3.5.4 擷取內容內含狀態轉移關鍵字之處理………………………52 3.5.5 有限狀態自動機狀態越多者越優先處理……………………58 3.6 總結………………………………………………………………60 第四章 實驗分析討論與系統應用………………………………………62 4.1 實驗測試資料與實作方法………………………………………62 4.2 實驗評估方式……………………………………………………65 4.3 實驗結果討論……………………………………………………67 4.3.1 實驗結果……………………………………………………67 4.3.2 實驗結果分析與討論………………………………………78 4.4 系統應用…………………………………………………………84 4.5 總結………………………………………………………………87 第五章 結論與未來研究方向……………………………………………89 5.1 結論………………………………………………………………89 5.2 未來研究方向……………………………………………………91 5.3 研究經驗與評論…………………………………………………92 參考文獻……………………………………………………………………94 附錄A 總統府人事任免公報範例…………………………………………98 附錄B 以擷取目標為單位的實驗結果……………………………………108 附錄C 以擷取欄位為單位的實驗結果……………………………………116 附錄D CAN中央社國內政治新聞範例……………………………………140 圖目錄 圖 2-1 答詢系統執行結果………………………………………………15 圖 2-2 WIEN rule 的範例…………………………………………24 圖 2-3 SoftMealy rule 的範例……………………………………25 圖 2-4 EC Formalism 範例……………………………………………26 圖 2.5 AutoSlog 所使用的經驗法則………………………………29 圖 2-6 經FASTUS第一階段處理後之語句形式…………………………31 圖 2-7 總統府公報範例…………………………………………………32 圖 3-1 擷取型態之有限狀態自動機轉換演算法………………………42 圖 3-2 擷取型態『TNA』的圖形表示法………………………………43 圖 3-3 擷取型態『TNA』的狀態轉移、擷取示意圖…………………44 圖 3-4 系統執行架構……………………………………………………48 圖 3-5 處理外層之有限狀態自動機……………………………………50 圖 3-6 處理內層之有限狀態自動機……………………………………51 圖 3-7 處理內層之有限狀態自動機……………………………………51 圖 3-8 目前狀態有擷取的動作但是下個狀態無擷取的動作之Case1 54 圖 3-9 目前狀態有擷取的動作但是下個狀態無擷取的動作之Case2 54 圖 3-10 擷取型態OTF之有限狀態自動機………………………………55 圖 3-11 擷取型態OT之有限狀態自動機…………………………………55 圖 3-12 目前狀態及下個狀態均有擷取的動作之Case 1………………57 圖 3-13 目前狀態及下個狀態均有擷取的動作之Case 2………………57 圖 3-14 擷取型態NBOT之有限狀態自動機………………………………57 圖 3-15 擷取型態NBT之有限狀態自動機………………………………58 圖 3-16 擷取型態『ANBOT』所之有限狀態自動機……………………59 圖 3-17 擷取型態『OT』之有限狀態自動機……………………………60 圖 3-18 系統架構圖………………………………………………………61 圖 4-1 總統府人事任免公報範例………………………………………63 圖 4-2 以擷取目標為單位採Regular Block實驗結果………………69 圖 4-3 以擷取目標為單位採Random Block實驗結果…………………70 圖 4-4 以人名(N)擷取欄位為單位採Regular Block實驗結果……71 圖 4-5 以人名(N)擷取欄位為單位採Random Block實驗結果……72 圖 4-6 以組織(O)擷取欄位為單位採Regular Block實驗結果……73 圖 4-7 以組織(O)擷取欄位為單位採Random Block實驗結果……74 圖 4-8 以職等(R)擷取欄位為單位採Regular Block實驗結果……75 圖 4-9 以職等(R)擷取欄位為單位採Random Block實驗結果……76 圖 4-10 以職稱(T)擷取欄位為單位採Regular Block實驗結果……77 圖 4-11 以職稱(T)擷取欄位為單位採Random Block實驗結果……78 圖 4-12 淡新檔案範例文件………………………………………………84 圖 4-13 以『國內政治』為測試文件的擷取結果………………………86 表目錄 表 2-1 資訊擷取與文件理解的比較……………………………………11 表 2-2 Knowledge Engineering Approach與Automatically TrainableApproach的比較…………………………………………………19 表 2-3 Knowledge Engineering Approach與Automatically Trainable Approach 的優缺點……………………………………………21 表 3-1 擷取型態『TNA』的transition table ………………………43 表 3-2 擷取模板範例……………………………………………………46 表 3-3 多層次擷取型態及其語意元素之屬性…………………………49 表 3-4 語意元素及其相對應之關鍵字…………………………………49 表 4-1 『總統府人事任免公報』的擷取模板…………………………64 表 4-2 『總統府人事任免公報』之部分多層次擷取型態與其語意元素 之屬性…………………………………………………………………………64 表 4-3 語意元素及其相對應之關鍵字…………………………………65 表 4-4 以擷取目標為單位的實驗結果…………………………………69 表 4-5 以人名(N)擷取欄位為單位的實驗結果……………………71 表 4-6 以組織(O)擷取欄位為單位的實驗結果……………………73 表 4-7 以職等(R)擷取欄位為單位的實驗結果……………………75 表 4-8 以職稱(T)擷取欄位為單位的實驗結果……………………77 表 4-9 以擷取目標為單位之實驗總表…………………………………82 表 4-10 以型態辨識為主搭配有限狀態自動機之資訊擷取系統與總統府人事任免公報資訊擷取系統之比較………………………………………88 | zh_TW |
dc.format.extent | 17863 bytes | - |
dc.format.extent | 20566 bytes | - |
dc.format.extent | 54666 bytes | - |
dc.format.extent | 45909 bytes | - |
dc.format.extent | 376210 bytes | - |
dc.format.extent | 435721 bytes | - |
dc.format.extent | 974360 bytes | - |
dc.format.extent | 37060 bytes | - |
dc.format.extent | 39459 bytes | - |
dc.format.extent | 2980002 bytes | - |
dc.format.mimetype | application/pdf | - |
dc.format.mimetype | application/pdf | - |
dc.format.mimetype | application/pdf | - |
dc.format.mimetype | application/pdf | - |
dc.format.mimetype | application/pdf | - |
dc.format.mimetype | application/pdf | - |
dc.format.mimetype | application/pdf | - |
dc.format.mimetype | application/pdf | - |
dc.format.mimetype | application/pdf | - |
dc.format.mimetype | application/pdf | - |
dc.language.iso | en_US | - |
dc.source.uri (資料來源) | http://thesis.lib.nccu.edu.tw/record/#G0090753018 | en_US |
dc.subject (關鍵詞) | 資訊擷取 | zh_TW |
dc.subject (關鍵詞) | 型態辨識 | zh_TW |
dc.subject (關鍵詞) | 有限狀態自動機 | zh_TW |
dc.subject (關鍵詞) | Information Extraction | en_US |
dc.subject (關鍵詞) | Pattern based | en_US |
dc.subject (關鍵詞) | Finite State Automata | en_US |
dc.title (題名) | 以型態辨識為主的中文資訊擷取技術研究 | zh_TW |
dc.type (資料類型) | thesis | en |
dc.relation.reference (參考文獻) | [1] Wilks, Y. and Catizone, R. 1999. Can We Make Information Extraction More Adaptive? In M. Pazienza (ed.) Proceedings of the Summer School on Information Extraction (SCIE-99) Workshop, Springer-Verlag, Berlin. Rome. | zh_TW |
dc.relation.reference (參考文獻) | [2] Applet, D. E. and Israel, D. J. 1999. Introduction to Information Extraction Technology. In Proceedings of the 16th International Joint Conference on Artificial Intelligence. | zh_TW |
dc.relation.reference (參考文獻) | [3] Jim Cowie , Wendy Lehnert . 1996. Information Extraction, Communications of the ACM (CACM), 39 (1), pp.80-91 | zh_TW |
dc.relation.reference (參考文獻) | [4] Chowhurv, G. G. 1999. Introduction to Modern Information Retrieval. London : Library Association Publishing. | zh_TW |
dc.relation.reference (參考文獻) | [5] Rohini Srihari and Wei Li. A Question Answering System Supported by Information Extraction. Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL-00), 166-172. | zh_TW |
dc.relation.reference (參考文獻) | [6] Grishman, Ralph and Beth M. Sundheim. 1996. Message Understanding Conference-6 : A Brief History. In Proceedings of the 16th International Conference on Computational Linguistics (COLING 96), Copenhagen, Denmark. | zh_TW |
dc.relation.reference (參考文獻) | [7] Peng, F. Models Development in IE Tasks – A survey. 1999. CS685 (Intelligent Computer Interface) course project, Computer Science Department, University of Waterloo. | zh_TW |
dc.relation.reference (參考文獻) | [8] Ellen Riloff. 1993. Automatically Constructing a Dictionary for Information Extraction Tasks. Proceeding for the Eleventh National Conference on Artificial Intelligence, pp.811-816. | zh_TW |
dc.relation.reference (參考文獻) | [9] Ellen Riloff. 1996. Automatically Generating Extraction Patterns from Untagged Text. In Proceedings of the Thriteenth National Conference on Artificial Intelligence, pp.1044-1049. | zh_TW |
dc.relation.reference (參考文獻) | [10] Califf, M. E. and Mooney R.J. 1999. Relational Learning of Pattern-match Rules for Information Extraction. In Proceedings of the 16th National Conference on AI, pp.328-334. | zh_TW |
dc.relation.reference (參考文獻) | [11] Kushmerick, N. Weld, D. and Doorenbos, R. 1997. Wrapper Induction for information extraction. In Proceedings of the 15th International Joint Conference on AI (IJCAI-97), pp. 729-737. | zh_TW |
dc.relation.reference (參考文獻) | [12] Kushmerick, N. 1998. Wrapper Induction: Efficiency and Expressiveness. Workshop on AI & Information Integration. In Proceedings of AAAI-98 Workshop on Artification Intelligence and Information Integration, pp. 15-68, AAAI Press, Menlo Park, California. | zh_TW |
dc.relation.reference (參考文獻) | [13] Chun-Nan Hsu and Ming-Tzung Dung. Aug 1998. Generating Finite-State Transducers for Semi-Structured Data Extraction from The Web, Journal of Infromation Systems, Special Issue on Semi-structured Data, Vol.23, No.8, pp.521-538. | zh_TW |
dc.relation.reference (參考文獻) | [14] Chun-Nan Hsu and Chien-Chi Chang. 1999. Finite-state Transducers for Semi-structured Text Mining. In Proceedings of IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications, pp. 38-49, Stockholm, Sweden. | zh_TW |
dc.relation.reference (參考文獻) | [15] Muslea, I. Minton, S. and Knoblock, C. 1998. STALKER: Learning Extraction Rules for Semi-structured, Web-based Information Sources. In Proceedings of AAAI-98 Workshop on AI and Information Integration, Technical Report WS-98-01, AAAI Press, Menlo Park, California. | zh_TW |
dc.relation.reference (參考文獻) | [16] Muslea, I. Minton, S. and Knoblock, C. 1999. A Hierarchical Approach to Wrapper Induction. In Proceedings of the 3rd International Conference on Autonomous Agents (Agents-99), pp. 190-197, Seattle, Washington. | zh_TW |
dc.relation.reference (參考文獻) | [17] Chia-Hui Chang and Chun-Nan Hsu. Dec 1999. Automatic Extraction of Information Blocks Using PAT Trees. In Proceedings of 1999 National Computer Symposium (NCS-1999), Tamking University, Tamsui, Taiwan. | zh_TW |
dc.relation.reference (參考文獻) | [18] Applet, D. Hobbs, J. Israel, D. Kameyama, M. Tyson, M. 1993. The SRI MUC-5 JV FASTUS Information Extraction System. Proceedings of the Fifth Message Understanding Conference. | zh_TW |
dc.relation.reference (參考文獻) | [19] Jyi-Shane Liu, Mu-Hsi. Tseng. November 2001. Extracting Government Personnel Information from Official Gazettes. In Proceedings of the Sixth Conference on Artificial Intelligence and Applications, pp. 593-598, Kaoshiung, Taiwan. | zh_TW |
dc.relation.reference (參考文獻) | [20] Chia-Hui Chang, Shao-Chen Lui, and Yen-Chin Wu. Apr2001. Applying Pattern Mining to Web Information Extraction. In Proceeding of the 5th Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD-2000), pp. 4-16, Hong Kong. | zh_TW |
dc.relation.reference (參考文獻) | [21] Chia-Hui Chang and Shao-Chen Lui. May 2001. IEPAD : Information Extraction based on Pattern Discovery, In Proceedings of the 10th International Conference on World Wide Web (WWW10), pp.595-609, Hong Kong. | zh_TW |
dc.relation.reference (參考文獻) | [22] Horowitz, E. SAHNI, S. Rajasekaran, S. Computer Algorithms/C++, pp.284-286 | zh_TW |
dc.relation.reference (參考文獻) | [23] Forrester Research, URL : http://www.forrester.com | zh_TW |
dc.relation.reference (參考文獻) | [24] Message Understanding Conferences, URL : http://www.muc.saic.com | zh_TW |
dc.relation.reference (參考文獻) | [25] Text Retrieval Conferences, URL : http://trec.nist.gov | zh_TW |
dc.relation.reference (參考文獻) | [26] QA Track Specifications, URL :http://www.research.att.com/~singhal/qa | zh_TW |
dc.relation.reference (參考文獻) | -track-sepc.txt | zh_TW |
dc.relation.reference (參考文獻) | [27] 總統府人事任免公報, URL : www.president.gov.tw/2_report/layer2.html | zh_TW |
dc.relation.reference (參考文獻) | [28] 淡新檔案, URL :http://www.lib.ntu.edu.tw/specialcollect/Coll_Taiwan/ | zh_TW |
dc.relation.reference (參考文獻) | Coll_Tan-hsin.htm | zh_TW |
dc.relation.reference (參考文獻) | [29] CAN中央社新聞全文檢索, URL : http://search.cnanews.gov.tw | zh_TW |
dc.relation.reference (參考文獻) | [30] L.F. Chien. 1997. PAT Tree Based Keyword Extraction for Chinese Information Retrieval, Proceedings of the ACM SIGIR International Conference on Information Retrieval. | zh_TW |