資料採礦中的資料純化過程之效果評估

Publications-Theses

Article View/Open

html(221)

Publication Export

Google Scholar^TM

題名	資料採礦中的資料純化過程之效果評估
作者	楊惠如
貢獻者	鄭宇庭<br>謝邦昌 <br> 楊惠如
關鍵詞	資料純化資料採礦遺漏值插補函數映射資料庫加值 Data Systematic Purifying Analysis Data Mining Missing Data Rare Data Imputation Functional Mapping Database Value-Added
日期	2004
上傳時間	2009-09-14
摘要	數年來台灣金控公司已如雨後春筍般冒出來，在金控公司底下含有產險公司、銀行、證券以及人壽公司等許多金融相關公司，因此，原本各自擺放於各子公司的資料庫可以通通整合在一起，當高階主管想提出決策時可利用資料庫進行資料採礦，以獲取有用的資訊。然而資料採礦的效果再怎麼神奇，也必須先有一個好的、完整的資料庫供使用，如果資料品質太差或者資料內容與研究目標無關，這是無法達成完美的資料採礦工作。透過抽樣調查與函數映射的方法使得資料庫得以加值，因此當有目標資料庫與輔助資料庫時，可以利用函數映射方法使資料庫整合為一個大資料庫，再將資料庫中遺失值或稀少值作插補得到增值後的資料庫。在此給予這個整個流程一個名詞 ”Data SPA(Data Systematic Purifying Analysis)”，即「資料純化」。在本研究中，主要就是針對純化完成的資料進行結構的確認，確認經過這些過程之後的資料是效用且正確的。在本研究採用了橫向評估、縱向評估與全面性評估三種方法來檢驗資料。資料純化後的資料經過三項評估後，可以發現資料以每個變數或者每筆觀察樣本的角度去查驗資料時，資料的表現並不理想，但是，資料的整體性卻是相當不錯。雖然以橫向評估和縱向評估來看，資料純化後的資料無法與原本完整的資料完全一致，但是透過資料純化的過程，資料得以插補且欄位得以擴增，這樣使得資料的資訊量增加，所以，資料純化確實有其效果，因為資訊量的增加對於要進行資料採礦的資料庫是一大助益。 For the past few years, Taiwan has experienced a tremendous growth in its financial industry namely in banks, life and property insurances, brokerages and security firms. Needless to say the need to store the data produced in this industry has become an important and a primary task to accomplish. Originally, firms store the data in their own database. With the progressive development of data management, the data now can be combined and stored into one large database that allows the users an easy access for data retrieval. However, if the quality of the data is questionable, then the existence of database would not provide much insightful information to the users. To tackle the fore mentioned problem, this research uses functional mapping combining the goal and auxiliary database and then imputes the missing data or the rare data from the combined database. This whole process is called Data Systematic Purifying Analysis (Data SPA). The purpose of this research is to evaluate whether there is any improvement of the structure of the data when the data has gone through the process of systematic purifying analysis. Generally the resulting data should be within good quality and useful. After the assessments of the data structure, the behavior of the data with respect to their added variables and observations is unsatisfactory. However the manifestation of the data as a whole has seen an improvement. The modified database through Data SPA has augmented the database making it more efficient to the usage of data mining techniques.
參考文獻	一、中文部分 1、尹相志(2003)，SQL 2000 Analysis Service資料採礦服務。台北：維科圖書有限公司。 2、包寶茹(2004)，應用資料採礦技術於資料庫加值中的誤差指標及模型準則，政治大學統計學研究所碩士論文。 3、李卓翰(2003)，資料倉儲理論與實務。台北：學貫行銷股份有限公司。 4、何冠章(1995)，資料庫應用。台北：高點文化事業有限公司。 5、邱蔚群(2003)，資料採礦技術在保險公司客戶保單貸款行為研究的應用，政治大學統計學研究所碩士論文。 6、林宏瑜(2001)，SQL2000之決策分析：OLAP建置與應用。台北：博碩文化股份有限公司。 7、林建言(2004)，利用函數映射進行資料庫增值於資料採礦中，政治大學統計學研究所碩士論文。 8、林傑斌，劉明德，陳湘(2002)，資料採掘與OLAP理論與實務。台北：文魁資訊股份有限公司。 9、韋端，鄭宇庭，鄧家駒，匡宏波，謝邦昌(2003)，Data Mining概述--以Clementine 7.0為例。台北：中華資料採礦協會。 10、陳順宇(1996)，迴歸分析。台北：華泰文化事業股份有限公司。 11、陳順宇(2004)，多變量分析。台北：華泰文化事業股份有限公司。 12、黃雅芳(2004)，應用資料採礦技術於資料庫加值中的插補方法比較，政治大學統計學研究所碩士論文。 13、黃登源 (1998)，應用迴歸分析。台北：華泰文化事業股份有限公司。 14、黃國源(2000)，類神經網路與圖形識別。台北：維科出版社。 15、蔡瑞煌(1995)，類神經網路概論。台北：三民書局股份有限公司。 16、蔣元隆，謝欽旭(1987)，人工智慧技術概論。台北：松崗電腦圖書資料有限公司。 17、諶家蘭(2002)，資料庫管理系統：理論與實務。台北：智勝文化事業有限公司。 18、謝邦昌(2001)，資料採礦入門及應用--從統計技術看資料採礦。台北：資商訊息顧問有限公司。 19、謝邦昌，易丹輝(2003)，統計資料分析：以STATISTICA6.0為例。台北：中華資料採礦協會。二、英文部分 1、Berry, M. J. A. and Linoff, G. S. (1997), Data Mining Techniques: for Marketing, Sales, and Customer Support. New York: John Wiley & Sons Inc. 2、Berry, M. J. A. and Linoff, G. S. (2000), Mastering Data Mining Techniques, The Art and Science of Customer Relationship Management. New York: John Wiley & Sons Inc. 3、Cios, K., Pedrycz, W. and Swiniarski, R.W. (1998) , Data Mining Methods for Knowledge Discovery . New York：Kluwer Academic Publishers. 4、Dasu, T. and Johnson, T. (2003), Exploratory Data Mining and Data Cleaning. New York：John Wiley & Sons Inc. 5、Delmater, R.(2001) , Data Mining Explained : A Manager`s Guide to Customer-Centric Business Intelligence. Boston：Digital Press. 6、Dunham, M. H. (2003), Data Mining: Introductory and Advanced Topics. New Jersey: Prentice Hall. 7、Groth, R. (1998) , Data Mining : A Hands-On Approach for Business Professionals. New York：Prentice Hall PTR. 8、Han, J. and Kamber, M.(2001), Data Mining：Concepts and Techniques. New York：Morgan Kaufmann Publishers. 9、Hand, D., Mannila, H. and Smyth, P. (2001), Principles of Data Mining.New York：MIT Press.
描述	碩士國立政治大學統計研究所 92354003 93
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0923540031
資料類型	thesis

dc.contributor.advisor	鄭宇庭<br>謝邦昌	zh_TW
dc.contributor.advisor	<br>	en_US
dc.contributor.author (Authors)	楊惠如	zh_TW
dc.creator (作者)	楊惠如	zh_TW
dc.date (日期)	2004	en_US
dc.date.accessioned	2009-09-14	-
dc.date.available	2009-09-14	-
dc.date.issued (上傳時間)	2009-09-14	-
dc.identifier (Other Identifiers)	G0923540031	en_US
dc.identifier.uri (URI)	https://nccur.lib.nccu.edu.tw/handle/140.119/30937	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	統計研究所	zh_TW
dc.description (描述)	92354003	zh_TW
dc.description (描述)	93	zh_TW
dc.description.abstract (摘要)	數年來台灣金控公司已如雨後春筍般冒出來，在金控公司底下含有產險公司、銀行、證券以及人壽公司等許多金融相關公司，因此，原本各自擺放於各子公司的資料庫可以通通整合在一起，當高階主管想提出決策時可利用資料庫進行資料採礦，以獲取有用的資訊。然而資料採礦的效果再怎麼神奇，也必須先有一個好的、完整的資料庫供使用，如果資料品質太差或者資料內容與研究目標無關，這是無法達成完美的資料採礦工作。透過抽樣調查與函數映射的方法使得資料庫得以加值，因此當有目標資料庫與輔助資料庫時，可以利用函數映射方法使資料庫整合為一個大資料庫，再將資料庫中遺失值或稀少值作插補得到增值後的資料庫。在此給予這個整個流程一個名詞 ”Data SPA(Data Systematic Purifying Analysis)”，即「資料純化」。在本研究中，主要就是針對純化完成的資料進行結構的確認，確認經過這些過程之後的資料是效用且正確的。在本研究採用了橫向評估、縱向評估與全面性評估三種方法來檢驗資料。資料純化後的資料經過三項評估後，可以發現資料以每個變數或者每筆觀察樣本的角度去查驗資料時，資料的表現並不理想，但是，資料的整體性卻是相當不錯。雖然以橫向評估和縱向評估來看，資料純化後的資料無法與原本完整的資料完全一致，但是透過資料純化的過程，資料得以插補且欄位得以擴增，這樣使得資料的資訊量增加，所以，資料純化確實有其效果，因為資訊量的增加對於要進行資料採礦的資料庫是一大助益。	zh_TW
dc.description.abstract (摘要)	For the past few years, Taiwan has experienced a tremendous growth in its financial industry namely in banks, life and property insurances, brokerages and security firms. Needless to say the need to store the data produced in this industry has become an important and a primary task to accomplish. Originally, firms store the data in their own database. With the progressive development of data management, the data now can be combined and stored into one large database that allows the users an easy access for data retrieval. However, if the quality of the data is questionable, then the existence of database would not provide much insightful information to the users. To tackle the fore mentioned problem, this research uses functional mapping combining the goal and auxiliary database and then imputes the missing data or the rare data from the combined database. This whole process is called Data Systematic Purifying Analysis (Data SPA). The purpose of this research is to evaluate whether there is any improvement of the structure of the data when the data has gone through the process of systematic purifying analysis. Generally the resulting data should be within good quality and useful. After the assessments of the data structure, the behavior of the data with respect to their added variables and observations is unsatisfactory. However the manifestation of the data as a whole has seen an improvement. The modified database through Data SPA has augmented the database making it more efficient to the usage of data mining techniques.	en_US
dc.description.tableofcontents	第一章緒論…………………………………………………….. 1 1.1 研究背景…………………………………………………………….1 1.2 研究動機…………………………………………………………….2 1.3 研究目的………………………………………………………… 5 1.4 研究流程…………………………………………………………… 5 1.5 論文架構…………………………………………………………… 8 第二章文獻探討……………………………………………….. 9 2.1 資料庫系統簡介…………………………………………………… 9 2.2 資料倉儲概論……………………………………………………… 10 2.3 資料採礦概述……………………………………………………… 12 2.3.1 資料採礦的定義與發展…………………………………..12 2.3.2 KDD與資料採礦的關係………………………………… 13 2.3.3 資料採礦的功能…………………………………………..16 2.4 資料庫加值………………………………………………………… 18 2.4.1 資料庫加值的意義………………………………………..18 2.4.2 函數映射的概念與方法…………………………………..19 2.4.3 資料插補的概念與方法…………………………………..21 2.5 資料採礦演算法…………………………………………………… 22 2.5.1 迴歸方法……………………………………………………22 2.5.2 決策樹……………………………………………………… 26 2.5.3 類神經網路………………………………………………… 31 第三章研究方法………………………………………………......36 3.1 資料純化概念論述……………………………………………….. 36 3.2 研究限制………………………………………………………….. 37 3.3 研究方式………………………………………………………….. 38 3.4 研究架構………………………………………………………….. 41 第四章實證分析……………………………………………….. 45 4.1 工商及服務業普查資料庫簡介……………………………………… 45 4.2 資料處理與準備……………………………………………………… 46 4.3 資料純化過程………………………………………………………… 49 4.4 效果評估……………………………………………………………… 53 4.4.1 橫向評估………………………………………………… 53 4.4.2 縱向評估……………………………………………………59 4.4.3 全面性評估…………………………………………………62 第五章結論與建議…………………………………………… 68 5.1 結論…………………………………………………………………… 68 5.2 未來改進與研究方向………………………………………………… 70 參考文獻…………………………………………………………... 72 附錄一……………………………………………………………. . 74 附錄二……………………………………………………………... 77 附錄三……………………………………………………………... 80 附錄四……………………………………………………………... 90 附錄五……………………………………………………………... 93 圖目錄圖 1-1 抽樣調查過程………………………………………………3 圖 1-2 函數映射……………………………………………………3 圖 1-3 資料庫加值流程……………………………………………4 圖 1-4 研究流程圖…………………………………………………7 圖 2-1 資料庫與資料庫系統…………………………………… 10 圖 2-2 資料倉儲與資料採礦的關係圖……………………………11 圖 2-3 知識發現流程………………………………………………16 圖 2-4 資料採礦的功能……………………………………………18 圖 2-5 函數映射架構………………………………………………20 圖 2-6 迴歸分析流程圖………………………………………… 24 圖 2-7 羅吉斯曲線圖………………………………………………25 圖 2-8 CART的分割…………………………………………………28 圖 2-9 人工神經元………………………………………………… 32 圖 2-10 類神經網路…………………………………………………33 圖 3-1 實驗流程……………………………………………………40 圖 3-2 研究架構流程一……………………………………………43 圖 3-3 研究架構流程二……………………………………………44 圖 4-1 薪資及福利津貼的資料分佈……………………………… 47 圖 4-2 有無經營三角貿易的資料分佈……………………………47 圖 4-3 組織別的資料分佈………………………………………… 48 圖 4-4 主要經營方式的資料分佈……………………………… 48 圖 4-5 從業員工人數的資料分佈…………………………………49 圖 4-6 土地面積的資料分佈…………………………………… 50 圖 4-7 建築物樓地板面積的資料分佈……………………………50 圖 4-8 產品銷售收入的資料分佈…………………………………51 圖 4-9 產品內銷銷售收入的資料分佈……………………………51 圖 4-10 產品外銷銷售收入的資料分佈……………………………51 圖 4-11 修配收入的資料分佈………………………………………52 圖 4-12 加工費收入的資料分佈……………………………………52 圖 4-13 其他非營業收入的資料分佈………………………………52 圖 4-14 流程一橫向評估的平均分數分配情形……………………54 圖 4-15 流程一橫向評估每個變數平均估計正確率………………55 圖 4-16 流程二橫向評估的平均分數分配情形……………………56 圖 4-17 流程二橫向評估每個變數平均估計正確率………………58 圖 4-18 流程一類神經模型有相同重要變數的個數分配…………63 圖 4-19 流程二類神經模型有相同重要變數的個數分配…………66 表目錄表2-1 資料採礦發展史…………………………………………………….14 表4-1 映射所產生變數的敘述性統計…………………………………….53 表4-2 流程一橫向評估的平均分數分配情形…………………………….55 表4-3 流程一橫向評估每個變數平均估計正確率……………………….56 表4-4 流程二橫向評估的平均分數分配情形…………………………….57 表4-5 流程二橫向評估每個變數平均估計正確率……………………….58 表4-6 流程一各變數實驗二十次的平均絕對差的平均與標準差……….60 表4-7 流程二各變數實驗二十次的平均絕對差的平均與標準差……….60 表4-8 流程一類神經模型有相同重要變數的個數分配………………….64 表4-9 類神經模型各變數的重覆率……………………………………… 64 表4-10 原始資料C5.0模型使用變數與分割點…………………………….65 表4-11 流程一C5.0模型使用變數與分割點……………………………….65 表4-12 流程二類神經模型有相同重要變數的個數分配………………….66	zh_TW
dc.language.iso	en_US	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0923540031	en_US
dc.subject (關鍵詞)	資料純化	zh_TW
dc.subject (關鍵詞)	資料採礦	zh_TW
dc.subject (關鍵詞)	遺漏值	zh_TW
dc.subject (關鍵詞)	插補	zh_TW
dc.subject (關鍵詞)	函數映射	zh_TW
dc.subject (關鍵詞)	資料庫加值	zh_TW
dc.subject (關鍵詞)	Data Systematic Purifying Analysis	en_US
dc.subject (關鍵詞)	Data Mining	en_US
dc.subject (關鍵詞)	Missing Data	en_US
dc.subject (關鍵詞)	Rare Data	en_US
dc.subject (關鍵詞)	Imputation	en_US
dc.subject (關鍵詞)	Functional Mapping	en_US
dc.subject (關鍵詞)	Database Value-Added	en_US
dc.title (題名)	資料採礦中的資料純化過程之效果評估	zh_TW
dc.type (資料類型)	thesis	en
dc.relation.reference (參考文獻)	一、中文部分	zh_TW
dc.relation.reference (參考文獻)	1、尹相志(2003)，SQL 2000 Analysis Service資料採礦服務。台北：維科圖書有限公司。	zh_TW
dc.relation.reference (參考文獻)	2、包寶茹(2004)，應用資料採礦技術於資料庫加值中的誤差指標及模型準則，政治大學統計學研究所碩士論文。	zh_TW
dc.relation.reference (參考文獻)	3、李卓翰(2003)，資料倉儲理論與實務。台北：學貫行銷股份有限公司。	zh_TW
dc.relation.reference (參考文獻)	4、何冠章(1995)，資料庫應用。台北：高點文化事業有限公司。	zh_TW
dc.relation.reference (參考文獻)	5、邱蔚群(2003)，資料採礦技術在保險公司客戶保單貸款行為研究的應用，政治大學統計學研究所碩士論文。	zh_TW
dc.relation.reference (參考文獻)	6、林宏瑜(2001)，SQL2000之決策分析：OLAP建置與應用。台北：博碩文化股份有限公司。	zh_TW
dc.relation.reference (參考文獻)	7、林建言(2004)，利用函數映射進行資料庫增值於資料採礦中，政治大學統計學研究所碩士論文。	zh_TW
dc.relation.reference (參考文獻)	8、林傑斌，劉明德，陳湘(2002)，資料採掘與OLAP理論與實務。台北：文魁資訊股份有限公司。	zh_TW
dc.relation.reference (參考文獻)	9、韋端，鄭宇庭，鄧家駒，匡宏波，謝邦昌(2003)，Data Mining概述--以Clementine 7.0為例。台北：中華資料採礦協會。	zh_TW
dc.relation.reference (參考文獻)	10、陳順宇(1996)，迴歸分析。台北：華泰文化事業股份有限公司。	zh_TW
dc.relation.reference (參考文獻)	11、陳順宇(2004)，多變量分析。台北：華泰文化事業股份有限公司。	zh_TW
dc.relation.reference (參考文獻)	12、黃雅芳(2004)，應用資料採礦技術於資料庫加值中的插補方法比較，政治大學統計學研究所碩士論文。	zh_TW
dc.relation.reference (參考文獻)	13、黃登源 (1998)，應用迴歸分析。台北：華泰文化事業股份有限公司。	zh_TW
dc.relation.reference (參考文獻)	14、黃國源(2000)，類神經網路與圖形識別。台北：維科出版社。	zh_TW
dc.relation.reference (參考文獻)	15、蔡瑞煌(1995)，類神經網路概論。台北：三民書局股份有限公司。	zh_TW
dc.relation.reference (參考文獻)	16、蔣元隆，謝欽旭(1987)，人工智慧技術概論。台北：松崗電腦圖書資料有限公司。	zh_TW
dc.relation.reference (參考文獻)	17、諶家蘭(2002)，資料庫管理系統：理論與實務。台北：智勝文化事業有限公司。	zh_TW
dc.relation.reference (參考文獻)	18、謝邦昌(2001)，資料採礦入門及應用--從統計技術看資料採礦。台北：資商訊息顧問有限公司。	zh_TW
dc.relation.reference (參考文獻)	19、謝邦昌，易丹輝(2003)，統計資料分析：以STATISTICA6.0為例。台北：中華資料採礦協會。	zh_TW
dc.relation.reference (參考文獻)	二、英文部分	zh_TW
dc.relation.reference (參考文獻)	1、Berry, M. J. A. and Linoff, G. S. (1997), Data Mining Techniques: for Marketing, Sales, and Customer Support. New York: John Wiley & Sons Inc.	zh_TW
dc.relation.reference (參考文獻)	2、Berry, M. J. A. and Linoff, G. S. (2000), Mastering Data Mining Techniques, The Art and Science of Customer Relationship Management. New York: John Wiley & Sons Inc.	zh_TW
dc.relation.reference (參考文獻)	3、Cios, K., Pedrycz, W. and Swiniarski, R.W. (1998) , Data Mining Methods for Knowledge Discovery . New York：Kluwer Academic Publishers.	zh_TW
dc.relation.reference (參考文獻)	4、Dasu, T. and Johnson, T. (2003), Exploratory Data Mining and Data Cleaning. New York：John Wiley & Sons Inc.	zh_TW
dc.relation.reference (參考文獻)	5、Delmater, R.(2001) , Data Mining Explained : A Manager`s Guide to Customer-Centric Business Intelligence. Boston：Digital Press.	zh_TW
dc.relation.reference (參考文獻)	6、Dunham, M. H. (2003), Data Mining: Introductory and Advanced Topics. New Jersey: Prentice Hall.	zh_TW
dc.relation.reference (參考文獻)	7、Groth, R. (1998) , Data Mining : A Hands-On Approach for Business Professionals.	zh_TW
dc.relation.reference (參考文獻)	New York：Prentice Hall PTR.	zh_TW
dc.relation.reference (參考文獻)	8、Han, J. and Kamber, M.(2001), Data Mining：Concepts and Techniques. New York：Morgan Kaufmann Publishers.	zh_TW
dc.relation.reference (參考文獻)	9、Hand, D., Mannila, H. and Smyth, P. (2001), Principles of Data Mining.New York：MIT Press.	zh_TW

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM