學術產出-學位論文

文章檢視/開啟

書目匯出

Google ScholarTM

政大圖書館

引文資訊

TAIR相關學術產出

題名 資料採礦中之模型選取
作者 孫莓婷
貢獻者 鄭宇庭<br>謝邦昌
<br>
孫莓婷
關鍵詞 資料採礦
插補方法
抽樣方法
模型選取
Data Minig
Imputation Method
Sampling
Model Selection
日期 2003
上傳時間 2009-09-14
摘要 有賴電腦的輔助,企業或組織內部所存放的資料量愈來愈多,加速資料量擴大的速度。但是大量的資料帶來的未必是大量的知識,即使擁有功能強大的資料庫系統,倘若不對資料作有意義的分析與推論,再大的資料庫也只是存放資料的空間。過去企業或組織只把資料庫當作查詢系統,並不知道可以藉由資料庫獲取有價值的資訊,而其中資料庫的內容完整與否更是重要。由於企業所擁有的資料庫未必健全,雖然擁有龐大資料庫,但是其中資訊未必足夠。我們認為利用資料庫加值方法:插補方法、抽樣方法、模型評估等步驟,以達到擴充資訊的目的,應該可以在不改變原始資料結構之下增加資料庫訊息。
      本研究主要在比較不同階段的資料經過加值動作後,是否還能與原始資料結構一致。研究架構大致分成三個主要流程,包括迴歸模型、羅吉斯迴歸模型與決策樹C5.0。經過不同階段的資料加值後,我們所獲得的結論為在迴歸模型為主要流程之下,利用迴歸為主的插補方法可以使加值後的資料庫較貼近原始資料,若想進一步採用抽樣方法縮減資料量,系統抽樣所獲得的結果會比利用簡單隨機抽樣來的好。而在決策樹C5.0的主要流程下,以類神經演算法作為插補的主要方法,在提增資訊量的同時,也使插補後的資料更接近原始資料。關於羅吉斯迴歸模型,由於間斷型變數的類別比例差異過大,致使此流程無法達到有效結論。
      經由實證分析可以瞭解不同的配模方式,表現較佳的資料庫加值技術也不盡相同,但是與未插補的資料庫相比較,利用資料庫加值技術的確可以增加資訊量,使加值後的虛擬資料庫更貼近原始資料結構。
With the fast pace of advancement in computer technology, computers have the capacity to store huge amount of data. The abundance of the data, without its proper treatment, does not necessary mean having valuable information on hand. As such, a large database system can merely serve as ways of accessing and storing. Keeping this in mind, we would like to focus on the integrity of the database. We adapt the methods where the missing values are imputed and added while leaving the data structure unmodified.
     
      The interest of this paper is to find out when the data are post value added using three different imputation methods, namely regression analysis, logistic regression analysis and C5.0 decision tree, which of the methods could provide the most consistent and resemblance value-added database to the original one. The results this paper has obtained are as the followings. The regression method, after imputation of the added value, produced the closer database structure to the original one. And in the case of having large amount of data where the smaller size of data is desired, then the systematic sampling provides a better outcome than the simple random sampling.
      The C5.0 decision tree method provides similar result as with the regression method. Finally with respect to the logistic regression analysis, the ratio of each class in the discrete variables is out of proportion, thereby making it difficult to make a reasonable conclusion.
     
      After going through the above studies, we have found that although the results from three different methods give slight different outcomes, one thing stands out and that is using the technique of value-added database could actually improve the authentic of the original database.
參考文獻 [中文部分]
1.何玉芝(2003),「資料採礦實務應用—以關連規則分析E-ICP商品消費資料」,政治大學統計學研究所碩士論文。
2.李其縵(2003),「以倒傳遞類神經網路應用於知識萃取之研究」,台北科技大學商業自動化與管理研究所碩士論文。
3.李銘鈞 (1999),「以類神經網路偵測多變量製成變異性變化之管制程序」,元智大學工業工程研究所。
4.李家旭(2003),「應用資料採礦技術於保險公司附加保單之增售」,政治大學統計學研究所碩士論文。
5.林建言(2004),「利用函數映射進行資料庫增值於資料採礦中」,政治大學統計學研究所碩士論文。
6.韋端,鄭宇庭,鄧家駒,匡宏波,謝邦昌(2003),「Data Mining 概述—以Clementine 7.0為例」,中華資料採礦協會。
7.張妤莉(2001),「資料挖掘之導入與影響—以銀行業為例」,政治大學企業管理學研究所碩士論文。
8.陳惠雯 (2004),「應用資料採礦技術於資料庫加值中的抽樣方法比較」,政治大學統計學研究所碩士論文。
9.黃文隆 (1999),「抽樣方法」,滄海書局。
10.黃雅芳 (2004),「應用資料採礦技術於資料庫加值中的插補方法比較」,政治大學統計學研究所碩士論文。
11.趙民德,謝邦昌 (1999),「探索真相—抽樣理論與實務」,曉園出版社。
12.葉怡成(2001),「應用類神經網路」,儒林圖書公司。
13.葉怡成(2001),「類神經網路模式應用與實作」,儒林圖書公司。
14.賴柔伶 (2000),「統計調查中插補法的研究」,輔仁大學應用統計研究所碩 士論文。
15.謝邦昌(2001),「資料採礦入門及應用—從統計技術看資料採礦」,諮商訊息顧問股份有限公司。
16.謝邦昌,易丹輝(2003),「統計資料分析—以Statistica 6.0為例」,中華資料採礦協會。
17.羅家蓉(2001), 「資料採礦之簡易系統—以流行病學為例」,政治大學統計學研究所碩士論文。
[英文部分]
1.Agresti, A. (1996), An Introduction to Categorical Data Analysis, John Wiley & Sons, Inc.
2.Berry,M.J.A.& Linoff, G.S. (1997), Data Mining Techniques: for Marketing Sales, and Customer Support, John Wiley & Sons Inc.
3.Berry,M.J.A.& Linoff, G.S. (2000), Mastering Data Mining Techniques, The Art and Science of Customer Relationship Management, John Wiley & Sons Inc.
4.Berson, A., Stephen S.& Kurt T. (2000), Building Data mining Applications for CRM , McGraw-Hill.
5.Brent L. C., Seabolt, J. D. & Thomson, R. W. & Williams, J. S. (2000), A SAS Institute White Paper: Finding the Solution to Data Mining.
6.Frawley, W. J., Andrew & Thearling, K. (1999), Increasing Customer Value by Integrating Data Mining and Campaign Management Software, Direct Marketin, Vol.61, No.10, pp. 49-53.
7.Frawley, W. J., G. Gregory, P. S., Matheus, C. J. (1991), Knowledge Discovery in Databases: an Overview in Knowledge Discovery in Databases , Cambridge, MA: AAAI/MIT, pp. 213-228.
8.Grupe, F. H.& Owrang, M. M. (1995), Database Mining Discovery New Knowledge and Cooperative Advantage , Information System Management, Vol. 12, No.4, pp26-31.
9.Hand, D. J. (1999), Statistics and Data Mining: Intersecting Displines, ACM SINGKDD Exporations, Vol. 1, Issue 1pp.16-19.
10.Held, G. (1998), From Data to Business Advantage: Data Mining, The SEMMA Methodology and SAS software.
11.Linoff, G. (1999), Data Mining: The Intelligence Behind CRM , Inform,  Nov/Dec, pp18-24.
12.Roiger,R. J. & Geatz, M. W. (2003), Data Mining: A TutorialBased Primer, Pearson Education, Inc.
13.Usama, F., Gregory, P. S., Smyth, P. (1996), The KDD Process for Extracting Useful Knowledge from Volumes of Data, Communications of the ACM, Vol.39, No.11 Nov., pp.27-34.
14.Usama, F., Grinistein, G. G. & Wiese, A. (2002), Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann.
描述 碩士
國立政治大學
統計研究所
92354024
92
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0923540243
資料類型 thesis
dc.contributor.advisor 鄭宇庭<br>謝邦昌zh_TW
dc.contributor.advisor <br>en_US
dc.contributor.author (作者) 孫莓婷zh_TW
dc.creator (作者) 孫莓婷zh_TW
dc.date (日期) 2003en_US
dc.date.accessioned 2009-09-14-
dc.date.available 2009-09-14-
dc.date.issued (上傳時間) 2009-09-14-
dc.identifier (其他 識別碼) G0923540243en_US
dc.identifier.uri (URI) https://nccur.lib.nccu.edu.tw/handle/140.119/30952-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 統計研究所zh_TW
dc.description (描述) 92354024zh_TW
dc.description (描述) 92zh_TW
dc.description.abstract (摘要) 有賴電腦的輔助,企業或組織內部所存放的資料量愈來愈多,加速資料量擴大的速度。但是大量的資料帶來的未必是大量的知識,即使擁有功能強大的資料庫系統,倘若不對資料作有意義的分析與推論,再大的資料庫也只是存放資料的空間。過去企業或組織只把資料庫當作查詢系統,並不知道可以藉由資料庫獲取有價值的資訊,而其中資料庫的內容完整與否更是重要。由於企業所擁有的資料庫未必健全,雖然擁有龐大資料庫,但是其中資訊未必足夠。我們認為利用資料庫加值方法:插補方法、抽樣方法、模型評估等步驟,以達到擴充資訊的目的,應該可以在不改變原始資料結構之下增加資料庫訊息。
      本研究主要在比較不同階段的資料經過加值動作後,是否還能與原始資料結構一致。研究架構大致分成三個主要流程,包括迴歸模型、羅吉斯迴歸模型與決策樹C5.0。經過不同階段的資料加值後,我們所獲得的結論為在迴歸模型為主要流程之下,利用迴歸為主的插補方法可以使加值後的資料庫較貼近原始資料,若想進一步採用抽樣方法縮減資料量,系統抽樣所獲得的結果會比利用簡單隨機抽樣來的好。而在決策樹C5.0的主要流程下,以類神經演算法作為插補的主要方法,在提增資訊量的同時,也使插補後的資料更接近原始資料。關於羅吉斯迴歸模型,由於間斷型變數的類別比例差異過大,致使此流程無法達到有效結論。
      經由實證分析可以瞭解不同的配模方式,表現較佳的資料庫加值技術也不盡相同,但是與未插補的資料庫相比較,利用資料庫加值技術的確可以增加資訊量,使加值後的虛擬資料庫更貼近原始資料結構。
zh_TW
dc.description.abstract (摘要) With the fast pace of advancement in computer technology, computers have the capacity to store huge amount of data. The abundance of the data, without its proper treatment, does not necessary mean having valuable information on hand. As such, a large database system can merely serve as ways of accessing and storing. Keeping this in mind, we would like to focus on the integrity of the database. We adapt the methods where the missing values are imputed and added while leaving the data structure unmodified.
     
      The interest of this paper is to find out when the data are post value added using three different imputation methods, namely regression analysis, logistic regression analysis and C5.0 decision tree, which of the methods could provide the most consistent and resemblance value-added database to the original one. The results this paper has obtained are as the followings. The regression method, after imputation of the added value, produced the closer database structure to the original one. And in the case of having large amount of data where the smaller size of data is desired, then the systematic sampling provides a better outcome than the simple random sampling.
      The C5.0 decision tree method provides similar result as with the regression method. Finally with respect to the logistic regression analysis, the ratio of each class in the discrete variables is out of proportion, thereby making it difficult to make a reasonable conclusion.
     
      After going through the above studies, we have found that although the results from three different methods give slight different outcomes, one thing stands out and that is using the technique of value-added database could actually improve the authentic of the original database.
en_US
dc.description.tableofcontents 第一章緒論………………………………………………………1
     1-1研究背景………………………………………………………1
     1-2研究動機…………………………………………………………………2
     1-3研究目的…………………………………………………………………3
     1-4研究架構…………………………………………………………………3
     1-5論文架構…………………………………………………………………5
     
     第二章文獻探討……………………………………………………...6
     2-1資料採礦概論………………………………………………………6
     2-1-1資料採礦的定義…………………………………………………..6
     2-1-2資料採礦與KDD的關係………………………………………....8
     2-1-3資料採礦的功能……………………………………………...…...9
     2-2資料庫加值…………………………………………………………12
     2-2-1資料庫加值的意義………………………………………….…….12
     2-2-2資料庫加值中之插補方法………………………………………..12
     2-2-3資料庫加值中之抽樣方法………………………………………..18
     2-3資料採礦演算法…………………………………………………………21
     2-3-1迴歸方法…………………………………………………………...21
     2-3-2類神經網路………………………………………………………...23
     2-3-3決策樹……………………………………………………………...29
     
     第三章研究方法……………….……………………………………...33
     3-1研究概論..............................................33
     3-1-1資料結構的比較…………………...................33
     3-1-2插補方法流程……………………………………………………….33
     3-1-3抽樣方法流程……………………………………………………….34
     3-1-4模型評估準則……………………………………………………….35
     3-2研究流程…………………………………………………36
     3-2-1各步驟說明………………………………………………………….36
     3-2-2三個主要流程說明………………………………………………….37
     
     第四章實證分析………………………………………………….…….41
     4-1資料庫簡介…………………………………………………...41
     4-2實證研究流程………………………………………………...43
     4-2-1資料前置處理………………………………………………………43
     4-2-2研究分析……………………………………………………………48
     4-2-3研究結果……………………………………………………………64
     
     第五章結論與未來研究方向…………………………………………..67
     5-1結論與建議…………………………………………………...67
     5-1-1結論………………………………………………………………....67
     5-1-2建議…………………………………………………………………68
     5-2未來研究方向……………………………………………………70
     
     參考文獻………………………………………………………………..71
     附錄一…………………………………………………………………..73
     附錄二…………………………………………………………………..76
     附錄三…………………………………………………………………..78
     附錄四…………………………………………………………………..79
     附錄五………………………………………………………………84
zh_TW
dc.language.iso en_US-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0923540243en_US
dc.subject (關鍵詞) 資料採礦zh_TW
dc.subject (關鍵詞) 插補方法zh_TW
dc.subject (關鍵詞) 抽樣方法zh_TW
dc.subject (關鍵詞) 模型選取zh_TW
dc.subject (關鍵詞) Data Minigen_US
dc.subject (關鍵詞) Imputation Methoden_US
dc.subject (關鍵詞) Samplingen_US
dc.subject (關鍵詞) Model Selectionen_US
dc.title (題名) 資料採礦中之模型選取zh_TW
dc.type (資料類型) thesisen
dc.relation.reference (參考文獻) [中文部分]zh_TW
dc.relation.reference (參考文獻) 1.何玉芝(2003),「資料採礦實務應用—以關連規則分析E-ICP商品消費資料」,政治大學統計學研究所碩士論文。zh_TW
dc.relation.reference (參考文獻) 2.李其縵(2003),「以倒傳遞類神經網路應用於知識萃取之研究」,台北科技大學商業自動化與管理研究所碩士論文。zh_TW
dc.relation.reference (參考文獻) 3.李銘鈞 (1999),「以類神經網路偵測多變量製成變異性變化之管制程序」,元智大學工業工程研究所。zh_TW
dc.relation.reference (參考文獻) 4.李家旭(2003),「應用資料採礦技術於保險公司附加保單之增售」,政治大學統計學研究所碩士論文。zh_TW
dc.relation.reference (參考文獻) 5.林建言(2004),「利用函數映射進行資料庫增值於資料採礦中」,政治大學統計學研究所碩士論文。zh_TW
dc.relation.reference (參考文獻) 6.韋端,鄭宇庭,鄧家駒,匡宏波,謝邦昌(2003),「Data Mining 概述—以Clementine 7.0為例」,中華資料採礦協會。zh_TW
dc.relation.reference (參考文獻) 7.張妤莉(2001),「資料挖掘之導入與影響—以銀行業為例」,政治大學企業管理學研究所碩士論文。zh_TW
dc.relation.reference (參考文獻) 8.陳惠雯 (2004),「應用資料採礦技術於資料庫加值中的抽樣方法比較」,政治大學統計學研究所碩士論文。zh_TW
dc.relation.reference (參考文獻) 9.黃文隆 (1999),「抽樣方法」,滄海書局。zh_TW
dc.relation.reference (參考文獻) 10.黃雅芳 (2004),「應用資料採礦技術於資料庫加值中的插補方法比較」,政治大學統計學研究所碩士論文。zh_TW
dc.relation.reference (參考文獻) 11.趙民德,謝邦昌 (1999),「探索真相—抽樣理論與實務」,曉園出版社。zh_TW
dc.relation.reference (參考文獻) 12.葉怡成(2001),「應用類神經網路」,儒林圖書公司。zh_TW
dc.relation.reference (參考文獻) 13.葉怡成(2001),「類神經網路模式應用與實作」,儒林圖書公司。zh_TW
dc.relation.reference (參考文獻) 14.賴柔伶 (2000),「統計調查中插補法的研究」,輔仁大學應用統計研究所碩 士論文。zh_TW
dc.relation.reference (參考文獻) 15.謝邦昌(2001),「資料採礦入門及應用—從統計技術看資料採礦」,諮商訊息顧問股份有限公司。zh_TW
dc.relation.reference (參考文獻) 16.謝邦昌,易丹輝(2003),「統計資料分析—以Statistica 6.0為例」,中華資料採礦協會。zh_TW
dc.relation.reference (參考文獻) 17.羅家蓉(2001), 「資料採礦之簡易系統—以流行病學為例」,政治大學統計學研究所碩士論文。zh_TW
dc.relation.reference (參考文獻) [英文部分]zh_TW
dc.relation.reference (參考文獻) 1.Agresti, A. (1996), An Introduction to Categorical Data Analysis, John Wiley & Sons, Inc.zh_TW
dc.relation.reference (參考文獻) 2.Berry,M.J.A.& Linoff, G.S. (1997), Data Mining Techniques: for Marketing Sales, and Customer Support, John Wiley & Sons Inc.zh_TW
dc.relation.reference (參考文獻) 3.Berry,M.J.A.& Linoff, G.S. (2000), Mastering Data Mining Techniques, The Art and Science of Customer Relationship Management, John Wiley & Sons Inc.zh_TW
dc.relation.reference (參考文獻) 4.Berson, A., Stephen S.& Kurt T. (2000), Building Data mining Applications for CRM , McGraw-Hill.zh_TW
dc.relation.reference (參考文獻) 5.Brent L. C., Seabolt, J. D. & Thomson, R. W. & Williams, J. S. (2000), A SAS Institute White Paper: Finding the Solution to Data Mining.zh_TW
dc.relation.reference (參考文獻) 6.Frawley, W. J., Andrew & Thearling, K. (1999), Increasing Customer Value by Integrating Data Mining and Campaign Management Software, Direct Marketin, Vol.61, No.10, pp. 49-53.zh_TW
dc.relation.reference (參考文獻) 7.Frawley, W. J., G. Gregory, P. S., Matheus, C. J. (1991), Knowledge Discovery in Databases: an Overview in Knowledge Discovery in Databases , Cambridge, MA: AAAI/MIT, pp. 213-228.zh_TW
dc.relation.reference (參考文獻) 8.Grupe, F. H.& Owrang, M. M. (1995), Database Mining Discovery New Knowledge and Cooperative Advantage , Information System Management, Vol. 12, No.4, pp26-31.zh_TW
dc.relation.reference (參考文獻) 9.Hand, D. J. (1999), Statistics and Data Mining: Intersecting Displines, ACM SINGKDD Exporations, Vol. 1, Issue 1pp.16-19.zh_TW
dc.relation.reference (參考文獻) 10.Held, G. (1998), From Data to Business Advantage: Data Mining, The SEMMA Methodology and SAS software.zh_TW
dc.relation.reference (參考文獻) 11.Linoff, G. (1999), Data Mining: The Intelligence Behind CRM , Inform,  Nov/Dec, pp18-24.zh_TW
dc.relation.reference (參考文獻) 12.Roiger,R. J. & Geatz, M. W. (2003), Data Mining: A TutorialBased Primer, Pearson Education, Inc.zh_TW
dc.relation.reference (參考文獻) 13.Usama, F., Gregory, P. S., Smyth, P. (1996), The KDD Process for Extracting Useful Knowledge from Volumes of Data, Communications of the ACM, Vol.39, No.11 Nov., pp.27-34.zh_TW
dc.relation.reference (參考文獻) 14.Usama, F., Grinistein, G. G. & Wiese, A. (2002), Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann.zh_TW