dc.contributor.advisor | 劉吉軒 | zh_TW |
dc.contributor.advisor | Liu, Jyi-Shane | en_US |
dc.contributor.author (Authors) | 鄭雍瑋 | zh_TW |
dc.contributor.author (Authors) | Cheng, Yung-Wei | en_US |
dc.creator (作者) | 鄭雍瑋 | zh_TW |
dc.creator (作者) | Cheng, Yung-Wei | en_US |
dc.date (日期) | 2005 | en_US |
dc.date.accessioned | 17-Sep-2009 13:56:10 (UTC+8) | - |
dc.date.available | 17-Sep-2009 13:56:10 (UTC+8) | - |
dc.date.issued (上傳時間) | 17-Sep-2009 13:56:10 (UTC+8) | - |
dc.identifier (Other Identifiers) | G0093753006 | en_US |
dc.identifier.uri (URI) | https://nccur.lib.nccu.edu.tw/handle/140.119/32649 | - |
dc.description (描述) | 碩士 | zh_TW |
dc.description (描述) | 國立政治大學 | zh_TW |
dc.description (描述) | 資訊科學學系 | zh_TW |
dc.description (描述) | 93753006 | zh_TW |
dc.description (描述) | 94 | zh_TW |
dc.description.abstract (摘要) | 資訊擷取是從自然語言文本中辨識出特定的主題或事件的描述,進而萃取出相關主題或事件元素中的對應資訊,再將其擷取之結果彙整至資料庫中,便能將自然語言文件轉換成結構化的核心資訊。然而資訊擷取技術的結果會有錯誤情況發生,若單只依靠人工檢查及更正錯誤的方式進行,將會是耗費大量人力及時間的工作。在本研究論文中,我們提出字串圖形結構與字串特徵值兩種錯誤資料偵測方法。前者是透過圖形結構比對各資料內字元及字元間關聯,接著由公式計算出每筆資料的比對分數,藉由分數高低可判斷是否為錯誤資料;後者則是利用字串特徵值,來描述字串外表特徵,再透過SVM和C4.5機器學習分類方法歸納出決策樹,進而分類正確與錯誤二元資料。而此兩種偵測方法的差異在於前者隱含了圖學理論之節點位置與鄰點概念,直接比對原始字串內容;後者則是將原始字串轉換成特徵數值,進行分類等動作。在實驗方面,我們以「總統府人事任免公報」之資訊擷取成果資料庫作為測試資料。實驗結果顯示,本研究所提出的錯誤偵測方法可以有效偵測出不合格的值組,不但能節省驗證資料所花費的成本,甚至可確保高資料品質的資訊擷取成果產出,促使資訊擷取技術更廣泛的實際應用。 | zh_TW |
dc.description.abstract (摘要) | Given a targeted subject and a text collection, information extraction techniques provide the capability to populate a database in which each record entry is a subject instance documented in the text collection. However, even with the state-of-the-art IE techniques, IE task results are expected to contain errors. Manual error detection and correction are labor intensive and time consuming. This validation cost remains a major obstacle to actual deployment of practical IE applications with high validity requirement. In this paper, we propose string graph structure and string feature-based methods. The former takes advantage of graph structure to compare characters and the relation between characters. Next step, we count the corresponding score via formula, and then the scores are takes to estimate the data correctness. The latter uses string features to describe a certain characteristics of each string, after that decision tree is generated by the C4.5 and SVM machine learning algorithms. And then classify the data is valid or not. These two detection methods have the ability to describe the feature of data and verify the correctness further. The difference between these two methods is that, we deal with string of row data directly in the previous method. Besides, it indicates the concept of node position and neighbor node in graphic theory. By contrast, the row string was transformed into feature value, and then be classified in the latter method.In our experiments, we use IE task results of government personnel directives as test data. We conducted experiments to verify that effective detection of IE invalid values can be achieved by using the string graph structure and string feature-based methods. The contribution of our work is to reduce validation cost and enhance the quality of IE results, even provide both analytical and empirical evidences for supporting the effective enhancement of IE results usability as well. | en_US |
dc.description.tableofcontents | 第1章 緒論 11.1. 研究背景 11.2. 研究動機與目的 21.3. 研究方法 31.4. 論文架構 4第2章 文獻探討 52.1. 資訊擷取 52.1.1 資訊擷取的定義 52.1.2 資訊擷取的方法 62.1.3 資訊擷取的技術 72.2. 資料品質 82.2.1 資料品質的定義 92.2.2 資料品質的構面 92.3. 資料清理 102.3.1 資料清理的定義 112.3.2 資料清理的相關技術 112.4 圖形結構 (Graph Structure) 142.5 分類分析法 (Classification Analysis) 142.5.1 ID3 決策樹歸納法 142.5.2 C4.5決策樹歸納法 152.5.3 支持向量機 (Support Vector Machine) 162.5.4 多專家分類器 172.6 小結 18第3章 錯誤資料偵測方法 203.1 需求分析 203.1.1 錯誤分析 203.1.2 資訊擷取成果的異常問題 213.1.3 資料偵測模型 223.2 字串圖形結構偵測方法 223.2.1 建構字串圖形結構器 233.2.2 偵測錯誤資料規則模型 273.2.3 資料推論器 273.3 字串特徵值偵測方法 343.3.1 擷取字串特徵器 353.3.2 字串特徵轉換器 373.3.3 C4.5演算法 413.3.4 SVM演算法 463.3.5 資料推論器 46第4章 實驗分析討論與方法應用 474.1 實驗測試資料 474.2 實驗評估方法 484.3 相關資料清理技術之分析 504.4 實驗設計與實驗結果討論 544.4.1 字串圖形方法實驗架構 544.4.1.1 圖形結構節點比對方式之實驗結果 554.4.1.2 訓練資料筆數之實驗結果 614.4.1.3 目標資料年份數之實驗結果 644.4.1.4 字串圖形方法參數之實驗結果 664.4.1.5字串圖形結構之實驗小結 704.4.2 字串特徵值方法實驗架構 714.4.2.1 字串特徵轉換器之實驗結果 724.4.2.2訓練及目標資料年份數之實驗結果 734.4.2.3字串特徵方法參數之實驗結果 744.4.2.4訓練資料筆數之實驗結果 774.4.2.5字串特徵值之實驗小結 834.4.3 總體實驗結果討論 884.5 錯誤偵測方法應用 894.6 總結 92第5章 結論與未來研究方向 945.1 結論 945.2 未來研究方向 96參考文獻 99附錄A 103附錄B 116 | zh_TW |
dc.format.extent | 48326 bytes | - |
dc.format.extent | 67750 bytes | - |
dc.format.extent | 94490 bytes | - |
dc.format.extent | 262748 bytes | - |
dc.format.extent | 111314 bytes | - |
dc.format.extent | 187085 bytes | - |
dc.format.extent | 254742 bytes | - |
dc.format.extent | 537057 bytes | - |
dc.format.extent | 133035 bytes | - |
dc.format.extent | 83723 bytes | - |
dc.format.extent | 345366 bytes | - |
dc.format.mimetype | application/pdf | - |
dc.format.mimetype | application/pdf | - |
dc.format.mimetype | application/pdf | - |
dc.format.mimetype | application/pdf | - |
dc.format.mimetype | application/pdf | - |
dc.format.mimetype | application/pdf | - |
dc.format.mimetype | application/pdf | - |
dc.format.mimetype | application/pdf | - |
dc.format.mimetype | application/pdf | - |
dc.format.mimetype | application/pdf | - |
dc.format.mimetype | application/pdf | - |
dc.language.iso | en_US | - |
dc.source.uri (資料來源) | http://thesis.lib.nccu.edu.tw/record/#G0093753006 | en_US |
dc.subject (關鍵詞) | 錯誤偵測 | zh_TW |
dc.subject (關鍵詞) | 資訊擷取 | zh_TW |
dc.subject (關鍵詞) | 文本資料描述 | zh_TW |
dc.subject (關鍵詞) | Error Detection | en_US |
dc.subject (關鍵詞) | Information Extraction | en_US |
dc.subject (關鍵詞) | Textual Data Profiling | en_US |
dc.title (題名) | 中文資訊擷取結果之錯誤偵測 | zh_TW |
dc.title (題名) | Error Detection on Chinese Information Extraction Results | en_US |
dc.type (資料類型) | thesis | en |
dc.relation.reference (參考文獻) | [1] Paulson, L. D., “Data Quality: a Rising e-Business Concern,” IT Professional, Vol. 2 No. 4, July-Aug. 2000, pp.10–14. | zh_TW |
dc.relation.reference (參考文獻) | [2] Rahm, E. and Do, H.-H., “Data Cleaning: Problems and Current Approaches,” IEEE Bulletin of the Technical Committee on Data Engineering, Vol. 23, No. 4, December 2000. | zh_TW |
dc.relation.reference (參考文獻) | [3] 翁家緯,“以型態辨識為主的中文資訊擷取技術研究”,國立政治大學資訊科學系碩士論文,2003。 | zh_TW |
dc.relation.reference (參考文獻) | [4] Message Understanding Conference, URL: http://www.muc.saic.com | zh_TW |
dc.relation.reference (參考文獻) | [5] Text Retrieval Conference, URL: http://trec.nist.gov | zh_TW |
dc.relation.reference (參考文獻) | [6] Jim Cowie, Wendy Lehnert. 1996. Information Extraction, Communications of the ACM(CACM), 39(1),pp.80-91 | zh_TW |
dc.relation.reference (參考文獻) | [7] Applet, D. E. and Israel, D.J. 1999. Introduction to Information extraction Technology. In Proceedings of the 16th International Joint Conference on Artificial Intelligence. | zh_TW |
dc.relation.reference (參考文獻) | [8] Peng, F. Models Development in IE tasks – A survey. 1999. CS685 (Intelligent Computer Interface) course project, Computer Science Department, University of Waterloo. | zh_TW |
dc.relation.reference (參考文獻) | [9] Ellen Riloff. 1993. Automatically Constructing a Dictionary for Information Extraction Tasks. Proceeding for the Eleventh National Conference on Artificial Intelligence, pp.811-816. | zh_TW |
dc.relation.reference (參考文獻) | [10] Ellen Riloff. 1996. Automatically Generating Extraction Patterns from Untagged Text. In Proceedings of the Thriteenth National Conference on Artificial Intelligence, pp.1044-1049. | zh_TW |
dc.relation.reference (參考文獻) | [11] Califf, M. E. and Mooney R.J. 1999. Relational Learning of Pattern- match Rules for Information Extraction. In Proceedings of the 16th National Conference on AI, pp.328-334. | zh_TW |
dc.relation.reference (參考文獻) | [12] Kushmerick, N. Weld, D. and Doorenbos, R. 1997. Wrapper Induction for information extraction. In Proceedings of the 15th International Joint Conference on AI (IJCAI-97), pp. 729-737. | zh_TW |
dc.relation.reference (參考文獻) | [13] Kushmerick, N. 1998. Wrapper Induction: Efficiency and Expressiveness. Workshop on AI & Information Integration. In Proceedings of AAAI-98 Workshop on Artification Intelligence and Information Integration, pp. 15-68, AAAI Press, Menlo Park, California. | zh_TW |
dc.relation.reference (參考文獻) | [14] Chun-Nan Hsu and Ming-Tzung Dung. Aug 1998. Generating Finite-State Transducers for Semi-Structured Data Extraction from The Web, Journal of Infromation Systems, Special Issue on Semi-structured Data, Vol.23, No.8, pp. 521-538. | zh_TW |
dc.relation.reference (參考文獻) | [15] Chun-Nan Hsu and Chien-Chi Chang. 1999. Finite-state Transducers for Semi-structured Text Mining. In Proceedings of IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications, pp. 38-49, Stockholm, Sweden. | zh_TW |
dc.relation.reference (參考文獻) | [16] Jyi-Shane Liu, Mu-Hsi. Tseng. November 2001. Extracting Government Personnel Information from Official Gazettes. In Proceedings of the Sixth Conference on Artificial Intelligence and Applications, pp. 593-598, Kaoshiung, Taiwan. | zh_TW |
dc.relation.reference (參考文獻) | [17] Oman, R. C. and Ayers, T. B. “Improving Data Quality,” Journal of Systems management, May 1988, pp.31-35. | zh_TW |
dc.relation.reference (參考文獻) | [18] Tayi, G. K. and Ballou, D. P. “Examining Data Quality,” Communications of the ACM (41:2), Feb. 1998, pp.54-57. | zh_TW |
dc.relation.reference (參考文獻) | [19] Ballou, D. P. and Pazer, H. L. “Implication of Data Quality for Spreadsheet Analysis,” Data Base, Spr. 1987, pp.13-19. | zh_TW |
dc.relation.reference (參考文獻) | [20] Redman, T.C. Data Quality for the Information Age, Artech House, Inc., 1996. Redman, T.C. “The Impact of Poor Data Quality on the Typical Enterprise,” Communications of the ACM (41:2), Feb. 1998, pp.79-82. | zh_TW |
dc.relation.reference (參考文獻) | [21] Brauer, B., “Data Quality –Spinning Straw Into Gold,” Available [Online] at: http://www2.sas.com/proceedings/sugi26/p117-26.pdf, 2000. | zh_TW |
dc.relation.reference (參考文獻) | [22] Muller, H., and Freytag, J. C. Problems, Methods, and Challenges in Comprehensive Data Cleansing. Technical Report HUB-IB-164, Humboldt University Berlin, 2003. | zh_TW |
dc.relation.reference (參考文獻) | [23] V. Raman and J. M. Hellerstein, An Interactive Framework for Data Cleaning, UC Berkeley Computer Science Division Report No. UCB/CSD00/1110, September 2000. | zh_TW |
dc.relation.reference (參考文獻) | [24] H. Galhardas, D. Florescu and D. Shasha, An Extensible Framework for Data Cleaning, INRIA Technical Report, 1999. | zh_TW |
dc.relation.reference (參考文獻) | [25] Kaufman, L. and Rousseeus, P. J., Finding Groups in Data: An | zh_TW |
dc.relation.reference (參考文獻) | introduction to Cluster Analysis, New York: John Wiley & Sons, 1990. | zh_TW |
dc.relation.reference (參考文獻) | [26] 李念秋,“資料品質改善之研究:錯誤資料偵測技術之發展與評估”,國立中山大學資訊管理系碩士論文,2002。 | zh_TW |
dc.relation.reference (參考文獻) | [27] Quinlan, J. R., “Induction of Decision Tree,” Machine Learning, Vol. 1, 1986, pp.81-106. | zh_TW |
dc.relation.reference (參考文獻) | [28] Quinlan, J. R., C4.5: Programs for Machine Learning, Morgen Kaufmann Publishers, San Mateo, CA, 1993. | zh_TW |
dc.relation.reference (參考文獻) | [29] Chan, P. K., Fan, W., Prodromidis, A. L., and Stolfo, S. J.,“Distributed Data Mining in Credit Card Fraud Detection,” IEEE Intelligent Systems, Vol. 14, No. 6, 1999, pp.67-74. | zh_TW |
dc.relation.reference (參考文獻) | [30] N.Cristianini, J. Shawf-Taylor. An Introduction to Support Vector Machines and | zh_TW |
dc.relation.reference (參考文獻) | other kernel-based learning methods,Cambridge University Press,2000. | zh_TW |
dc.relation.reference (參考文獻) | [31] V. Vapnik. Statistical Learning Theory. Wiley, 1998. | zh_TW |
dc.relation.reference (參考文獻) | [32] Elmasri, R., and Navathe, S., Fundamentals Of Database Systems, 3rd edition , 2000. | zh_TW |
dc.relation.reference (參考文獻) | [33] LIBSVM, http://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html, URL:http://www.csie.ntu.edu.tw/~r91034/svm/svm_tutorial.html | zh_TW |
dc.relation.reference (參考文獻) | [34] Redman, T., Data Quality for the Information Age, Artech House, Boston, 1996. | zh_TW |
dc.relation.reference (參考文獻) | [35] 總統府人事任免公報,URL:http://www.president.gov.tw/2_report/layer2.html | zh_TW |
dc.relation.reference (參考文獻) | [36] Maletic, J.I. and Marcus, A., Data Cleansing: Beyond Integrity Analysis. Proceedings of the Conference on Information Quality (IQ2000), Boston, October 2000. | zh_TW |
dc.relation.reference (參考文獻) | [37] 立法院新聞知識管理系統,URL: http://nplnews.ly.gov.tw/index.jsp | zh_TW |