Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

題名 應用資料採礦技術於資料庫加值中的抽樣方法
THE SAMPLING METHODS FOR VALUE-ADDED DATABASE IN DATA-MINING
作者 陳惠雯
貢獻者 鄭宇庭<br>謝邦昌
<br>
陳惠雯
關鍵詞 資料庫
資料採礦
抽樣方法
資料加值
Database
Data Mining
Sampling
Value-added database
日期 2003
上傳時間 2009-09-14
摘要 In the wake of growing database that has already become the trend of today’s business environment within the foreseeable future, reviewing quality information from mountains of data residing on corporations or organizations’ network such as sales figures, manufacturing statistics, financial data and experimental data is clearly costly, time consuming and definitely ineffective approach. Therefore we would need a sound and effective method in obtaining only portions of the data that are representative to the population and which allow us to build the reliable model based upon the sampled data. However, sometimes we have a situation where the database is of limited in size, under such circumstance, we initiate the idea which is relatively new to adding the attributes or values into the database to enhance the quality of the data Follow through such a procedure; it is obvious that implementing a good sampling method is an important groundwork leading us to reach final destination that is obtaining a reliable predictive model. And this is our research goal that is to get an effective and representative value-added sample of by means of sampling method for building an accuracy predictive model. The concept is pretty straightforward that is if we want to get good predictive samples then we need the correct sampling methods. The sampling methods under study are simple random sample, system sample, stratified sample and uniform design. The models used are the C5.0, logistic regression, and neural network for categorical predictive variable and stepwise regression for continuous predictive variable. The results are discussed in the conclusion section.
     
     Keywords: Database、Data Mining、Sampling、Value-added database
參考文獻 Chinese
[1] 趙民德、謝邦昌,探索真相-抽樣理論和實務,曉園出版社,1999.
[2] 黃文隆,抽樣方法,滄海書局,1999.
[3] 趙民德,砂中選礦(Data Mining)的一些我見我思,中國統計學報,2002,12.
[4] 王濟川、郭志剛,Logistic 迴歸模型-方法及應用,五南圖書出版股份有限公司,2003,3.
[5] 崔巍 編著, 陳舜德 審校,資料庫系統與應用,博碩文化股份有限公司,
2001,4.
[6] 張慶賀,資料倉儲中實體化視域自我維護之研究,朝陽科技大學,2003.
English
[1] Alan Mayne,Michael B Wood,Introducing Relational Database,1983.
[2] Bernd Gartner and Emo Welzl,A Simple Sampling Lemma: Analysis and Applications in Geometric Optimization,2002,4.
[3] Colleen McCue、Emilys. Stone、Teresap. Gooch,Data Mining and Value-Added Analysis,2003.
[4] CHAP T. LE,APPLIED CATEGORICAL DATA ANALYSIS,Wiley-Interscience Publication,1998.
[5] C. J. Date,Relational Database Writings 1991-1994,1995.
[6] David Hand、Heikki Mannila、and Padhraic Smyth,PRINCIPLES OF Data Mining,2001.
[7] Laboratory 2: Ecological population: a crash course in sampling and statistics.
[8] Margaret H.Dunham,DATA MINING Introductory and Advanced Topics,2003.
[9] Saerndal Carl-Erik、Bengt Swensson、Jan Wretman,Model Assisted Survey Sampling,New York: Springer-Verlag,1992.
[10] USDA Technical Services Division: GRAIN INSPECTION PACKERS AND STOCKYARDS ADMINISIRATION,2001,1.
[11] William Mendenhall、Terry Sincich,A SECOND COURSE IN STATISTICS REGRESSION ANALYSIS,PRENTICE FALL,fifth edition,1996.
描述 碩士
國立政治大學
統計研究所
91354016
92
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0091354016
資料類型 thesis
dc.contributor.advisor 鄭宇庭<br>謝邦昌zh_TW
dc.contributor.advisor <br>en_US
dc.contributor.author (Authors) 陳惠雯zh_TW
dc.creator (作者) 陳惠雯zh_TW
dc.date (日期) 2003en_US
dc.date.accessioned 2009-09-14-
dc.date.available 2009-09-14-
dc.date.issued (上傳時間) 2009-09-14-
dc.identifier (Other Identifiers) G0091354016en_US
dc.identifier.uri (URI) https://nccur.lib.nccu.edu.tw/handle/140.119/30885-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 統計研究所zh_TW
dc.description (描述) 91354016zh_TW
dc.description (描述) 92zh_TW
dc.description.abstract (摘要) In the wake of growing database that has already become the trend of today’s business environment within the foreseeable future, reviewing quality information from mountains of data residing on corporations or organizations’ network such as sales figures, manufacturing statistics, financial data and experimental data is clearly costly, time consuming and definitely ineffective approach. Therefore we would need a sound and effective method in obtaining only portions of the data that are representative to the population and which allow us to build the reliable model based upon the sampled data. However, sometimes we have a situation where the database is of limited in size, under such circumstance, we initiate the idea which is relatively new to adding the attributes or values into the database to enhance the quality of the data Follow through such a procedure; it is obvious that implementing a good sampling method is an important groundwork leading us to reach final destination that is obtaining a reliable predictive model. And this is our research goal that is to get an effective and representative value-added sample of by means of sampling method for building an accuracy predictive model. The concept is pretty straightforward that is if we want to get good predictive samples then we need the correct sampling methods. The sampling methods under study are simple random sample, system sample, stratified sample and uniform design. The models used are the C5.0, logistic regression, and neural network for categorical predictive variable and stepwise regression for continuous predictive variable. The results are discussed in the conclusion section.
     
     Keywords: Database、Data Mining、Sampling、Value-added database
en_US
dc.description.tableofcontents ABSTRACT
     LIST OF TABLES
     LIST OF FIGURES
     LIST OF MODEL
     Chapter 1 INTRODUCTION 1
     1.1. Research Background 1
     1.2. Research Motive 1
     1.3. Research Purpose 2
     1.4. Research Flow 3
     Chapter 2 LITERATURE REVIEW 4
     2.1. Database and Relational Database 4
     2.2. Data Warehouse 8
     2.3. Data Mining 11
     2.4. Introduction to Sampling Method 19
     2.4.1. Simple Random Sample 21
     2.4.2. Systematic Sample 22
     2.4.3. Stratified Sample 23
     2.4.4. Uniform Design 23
     2.5. The Predictive Model 28
     2.5.1. Neural Networks 28
     2.5.1.1. Introduce to Neural Network 28
     2.5.1.2. Backpropagation Network 30
     2.5.2. Cluster Methods 32
     2.5.2.1. C5.0 32
     2.5.2.2. CART 33
     2.5.3. Regression Model 33
     2.5.3.1. Stepwise Regression 34
     2.5.3.2. Logistic Regression 37
     Chapter 3 RESEARCH METHODOLOGY 41
     3.1. Research Concept 41
     3.2. Research Frame 43
     Chapter 4 EXPERIMENTAL RESULTS 46
     4.1. Introduction to Database 46
     4.2. The Research Content 49
     4.2.1. The Distribution of Data 49
     4.2.2. Sampling 57
     4.3. Compare the Sampling Methods 58
     4.3.1. C5.0 58
     4.3.2. Neural Networks 63
     4.3.3. Logistic Regression 69
     4.3.4. Stepwise Regression 73
     4.3.5. Compare the Models Accuracy 75
     4.4. The Discussion of Stratified Sampling Method 78
     Chapter 5 CONCLUSION AND RESEARCH DIRECTION 81
     5.1. Conclusion 81
     5.2. Suggestion 84
     5.3. Future Work 84
     REFERENCES 86
     APPENDIX………………………………………………………………………..88
     
     List of Tables
     Table 2.1 the dummy variable table 40
     Table 3.1 the classify table 45
     Table 4.1 all variables 47
     Table 4.2 the research variables 49
     Table 4.3 the continuous variables 52
     Table 4.4 the sample size of the different sample methods 57
     Table 4.5 the correct rates on C5.0 59
     Table 4.6 the mean and variance of correct rates on C5.0 60
     Table 4.7 the alpha values on C5.0 61
     Table 4.8 the mean and variance of the alpha values on C5.0 62
     Table 4.9 the beta values on C5.0 62
     Table 4.10 the mean and variance of the beta values on C5.0 63
     Table 4.11 the result of neural networks 64
     Table 4.12 the correct rates on neural networks 64
     Table 4.13 the mean and variance of the correct rates on neural networks 65
     Table 4.14 the alpha values on Neural Networks 66
     Table 4.15 the mean and variance of the alpha values on neural networks 67
     Table 4.16 the beta values on neural networks 67
     Table 4.17 the mean and variance of the beta values on neural networks 68
     Table 4.18 the correct rates on logistic regression 69
     Table 4.19 the mean and variance of the correct rates on logistic regression 70
     Table 4.20 the alpha values on logistic regression 70
     Table 4.21 the mean and variance of the alpha values on logistic regression 71
     Table 4.22 the beta values on logistic regression 72
     Table 4.23 the mean and variance of the beta values on logistic regression 73
     Table 4.24 the output of the regression 73
     Table 4.25 the MSE values 74
     Table 4.26 the mean and variance of MSE values 75
     Table 4.27 the compared correct rates on mean and variance 75
     Table 4.28 the compared mean on alpha and beta values 77
     Table 4.29 the correct rates on four stratified variables in C5.0 79
     Table 4.30 the correct rates on four stratified variables in neural networks 79
     Table 4.31 the correct rates on four stratified variables in logistic regression 79
     Table 4.32 the mean of correct rates 80
     
     
     List of Figures
     Figure 2.1 the relational algebra 7
     Figure 2.2 the organization of Data Warehouse 9
     Figure 2.3 KDD process 12
     Figure 2.4 data mining models and tasks 13
     Figure 2.5 main methodology for data mining 15
     Figure 2.6 the flow of CRISP-DM 16
     Figure 2.7 the original scoter plot 26
     Figure 2.8 the scoter plot after orthogonal 26
     Figure 2.9 the scoter plot for correlated variable 26
     Figure 2.10 the scoter plot for correlated variable without orthogonal in PSA 26
     Figure 2.11 the scoter plot for correlated variable after orthogonal in PSA 27
     Figure 2.12 the model of artificial neural network 29
     Figure 2.13 the backpropagation network 31
     Figure 2.14 stepwise regression method 35
     Figure 2.15 the graph of logistic model 40
     Figure 3.1 the graph of research concept 42
     Figure 3.2 the research frame 43
     Figure 4.1 the distribution of ground 53
     Figure 4.2 the distribution of floor area of buildings 53
     Figure 4.3 the distribution of workers 53
     Figure 4.4 the distribution of salary 53
     Figure 4.5 the distribution of operating expenditures 54
     Figure 4.6 the distribution of operating revenues 54
     Figure 4.7 the distribution of total assets 54
     Figure 4.8 the distribution of fixed assets rented and borrowed 54
     Figure 4.9 the distribution of fixed assets rented and lent 55
     Figure 4.10 the distribution of expenditures on research development and technology acquiring 55
     Figure 4.11 the distribution of expenditures on environment protection 55
     Figure 4.12 the distribution of total value of production 55
     Figure 4.13 the distribution of net value added 55
     Figure 4.14 the distribution of net value of interest expenditures 56
     Figure 4.15 the distribution of current assets 56
     Figure 4.16 the distribution of profit 56
     Figure 4.17 the distribution of triangular trade 56
     Figure 4.18 the distribution of computer 56
     Figure 4.19 the distribution of E-commerce 57
     Figure 4.20 the distribution of profit 57
     Figure 4.21 the result of C5.0 58
     Figure 4.22 the correct rates on C5.0 60
     Figure 4.23 the alpha values on C5.0 61
     Figure 4.24 the beta values on C5.0 63
     Figure 4.25 the correct rates on neural networks 65
     Figure 4.26 the alpha values on Neural Networks 66
     Figure 4.27 the beta values on neural networks 68
     Figure 4.28 the correct rates on logistic regression 69
     Figure 4.29 the alpha values on logistic regression 71
     Figure 4.30 the beta values on logistic regression 72
     Figure 4.31 the compared correct rates on mean 76
     Figure 4.32 the compared correct rates on variance 76
     Figure 4.33 the compared mean on alpha and beta values 78
     Figure 4.34 the mean of correct rates 80
     
     
     List of model
     Function (1) Kokasama-Hlawka inequality 24
     Function (2) the Sigmoid function 30
     Function (3) the logistic function 39
     Function (4) the multiple logistic function 39
     Function (5) the multiple logistic function 40
     Function (6) the stepwise regression model 73
zh_TW
dc.language.iso en_US-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0091354016en_US
dc.subject (關鍵詞) 資料庫zh_TW
dc.subject (關鍵詞) 資料採礦zh_TW
dc.subject (關鍵詞) 抽樣方法zh_TW
dc.subject (關鍵詞) 資料加值zh_TW
dc.subject (關鍵詞) Databaseen_US
dc.subject (關鍵詞) Data Miningen_US
dc.subject (關鍵詞) Samplingen_US
dc.subject (關鍵詞) Value-added databaseen_US
dc.title (題名) 應用資料採礦技術於資料庫加值中的抽樣方法zh_TW
dc.title (題名) THE SAMPLING METHODS FOR VALUE-ADDED DATABASE IN DATA-MININGen_US
dc.type (資料類型) thesisen
dc.relation.reference (參考文獻) Chinesezh_TW
dc.relation.reference (參考文獻) [1] 趙民德、謝邦昌,探索真相-抽樣理論和實務,曉園出版社,1999.zh_TW
dc.relation.reference (參考文獻) [2] 黃文隆,抽樣方法,滄海書局,1999.zh_TW
dc.relation.reference (參考文獻) [3] 趙民德,砂中選礦(Data Mining)的一些我見我思,中國統計學報,2002,12.zh_TW
dc.relation.reference (參考文獻) [4] 王濟川、郭志剛,Logistic 迴歸模型-方法及應用,五南圖書出版股份有限公司,2003,3.zh_TW
dc.relation.reference (參考文獻) [5] 崔巍 編著, 陳舜德 審校,資料庫系統與應用,博碩文化股份有限公司,zh_TW
dc.relation.reference (參考文獻) 2001,4.zh_TW
dc.relation.reference (參考文獻) [6] 張慶賀,資料倉儲中實體化視域自我維護之研究,朝陽科技大學,2003.zh_TW
dc.relation.reference (參考文獻) Englishzh_TW
dc.relation.reference (參考文獻) [1] Alan Mayne,Michael B Wood,Introducing Relational Database,1983.zh_TW
dc.relation.reference (參考文獻) [2] Bernd Gartner and Emo Welzl,A Simple Sampling Lemma: Analysis and Applications in Geometric Optimization,2002,4.zh_TW
dc.relation.reference (參考文獻) [3] Colleen McCue、Emilys. Stone、Teresap. Gooch,Data Mining and Value-Added Analysis,2003.zh_TW
dc.relation.reference (參考文獻) [4] CHAP T. LE,APPLIED CATEGORICAL DATA ANALYSIS,Wiley-Interscience Publication,1998.zh_TW
dc.relation.reference (參考文獻) [5] C. J. Date,Relational Database Writings 1991-1994,1995.zh_TW
dc.relation.reference (參考文獻) [6] David Hand、Heikki Mannila、and Padhraic Smyth,PRINCIPLES OF Data Mining,2001.zh_TW
dc.relation.reference (參考文獻) [7] Laboratory 2: Ecological population: a crash course in sampling and statistics.zh_TW
dc.relation.reference (參考文獻) [8] Margaret H.Dunham,DATA MINING Introductory and Advanced Topics,2003.zh_TW
dc.relation.reference (參考文獻) [9] Saerndal Carl-Erik、Bengt Swensson、Jan Wretman,Model Assisted Survey Sampling,New York: Springer-Verlag,1992.zh_TW
dc.relation.reference (參考文獻) [10] USDA Technical Services Division: GRAIN INSPECTION PACKERS AND STOCKYARDS ADMINISIRATION,2001,1.zh_TW
dc.relation.reference (參考文獻) [11] William Mendenhall、Terry Sincich,A SECOND COURSE IN STATISTICS REGRESSION ANALYSIS,PRENTICE FALL,fifth edition,1996.zh_TW