dc.contributor.advisor | 鄭宇庭<br>謝邦昌 | zh_TW |
dc.contributor.advisor | <br> | en_US |
dc.contributor.author (Authors) | 陳惠雯 | zh_TW |
dc.creator (作者) | 陳惠雯 | zh_TW |
dc.date (日期) | 2003 | en_US |
dc.date.accessioned | 2009-09-14 | - |
dc.date.available | 2009-09-14 | - |
dc.date.issued (上傳時間) | 2009-09-14 | - |
dc.identifier (Other Identifiers) | G0091354016 | en_US |
dc.identifier.uri (URI) | https://nccur.lib.nccu.edu.tw/handle/140.119/30885 | - |
dc.description (描述) | 碩士 | zh_TW |
dc.description (描述) | 國立政治大學 | zh_TW |
dc.description (描述) | 統計研究所 | zh_TW |
dc.description (描述) | 91354016 | zh_TW |
dc.description (描述) | 92 | zh_TW |
dc.description.abstract (摘要) | In the wake of growing database that has already become the trend of today’s business environment within the foreseeable future, reviewing quality information from mountains of data residing on corporations or organizations’ network such as sales figures, manufacturing statistics, financial data and experimental data is clearly costly, time consuming and definitely ineffective approach. Therefore we would need a sound and effective method in obtaining only portions of the data that are representative to the population and which allow us to build the reliable model based upon the sampled data. However, sometimes we have a situation where the database is of limited in size, under such circumstance, we initiate the idea which is relatively new to adding the attributes or values into the database to enhance the quality of the data Follow through such a procedure; it is obvious that implementing a good sampling method is an important groundwork leading us to reach final destination that is obtaining a reliable predictive model. And this is our research goal that is to get an effective and representative value-added sample of by means of sampling method for building an accuracy predictive model. The concept is pretty straightforward that is if we want to get good predictive samples then we need the correct sampling methods. The sampling methods under study are simple random sample, system sample, stratified sample and uniform design. The models used are the C5.0, logistic regression, and neural network for categorical predictive variable and stepwise regression for continuous predictive variable. The results are discussed in the conclusion section. Keywords: Database、Data Mining、Sampling、Value-added database | en_US |
dc.description.tableofcontents | ABSTRACT LIST OF TABLES LIST OF FIGURES LIST OF MODEL Chapter 1 INTRODUCTION 1 1.1. Research Background 1 1.2. Research Motive 1 1.3. Research Purpose 2 1.4. Research Flow 3 Chapter 2 LITERATURE REVIEW 4 2.1. Database and Relational Database 4 2.2. Data Warehouse 8 2.3. Data Mining 11 2.4. Introduction to Sampling Method 19 2.4.1. Simple Random Sample 21 2.4.2. Systematic Sample 22 2.4.3. Stratified Sample 23 2.4.4. Uniform Design 23 2.5. The Predictive Model 28 2.5.1. Neural Networks 28 2.5.1.1. Introduce to Neural Network 28 2.5.1.2. Backpropagation Network 30 2.5.2. Cluster Methods 32 2.5.2.1. C5.0 32 2.5.2.2. CART 33 2.5.3. Regression Model 33 2.5.3.1. Stepwise Regression 34 2.5.3.2. Logistic Regression 37 Chapter 3 RESEARCH METHODOLOGY 41 3.1. Research Concept 41 3.2. Research Frame 43 Chapter 4 EXPERIMENTAL RESULTS 46 4.1. Introduction to Database 46 4.2. The Research Content 49 4.2.1. The Distribution of Data 49 4.2.2. Sampling 57 4.3. Compare the Sampling Methods 58 4.3.1. C5.0 58 4.3.2. Neural Networks 63 4.3.3. Logistic Regression 69 4.3.4. Stepwise Regression 73 4.3.5. Compare the Models Accuracy 75 4.4. The Discussion of Stratified Sampling Method 78 Chapter 5 CONCLUSION AND RESEARCH DIRECTION 81 5.1. Conclusion 81 5.2. Suggestion 84 5.3. Future Work 84 REFERENCES 86 APPENDIX………………………………………………………………………..88 List of Tables Table 2.1 the dummy variable table 40 Table 3.1 the classify table 45 Table 4.1 all variables 47 Table 4.2 the research variables 49 Table 4.3 the continuous variables 52 Table 4.4 the sample size of the different sample methods 57 Table 4.5 the correct rates on C5.0 59 Table 4.6 the mean and variance of correct rates on C5.0 60 Table 4.7 the alpha values on C5.0 61 Table 4.8 the mean and variance of the alpha values on C5.0 62 Table 4.9 the beta values on C5.0 62 Table 4.10 the mean and variance of the beta values on C5.0 63 Table 4.11 the result of neural networks 64 Table 4.12 the correct rates on neural networks 64 Table 4.13 the mean and variance of the correct rates on neural networks 65 Table 4.14 the alpha values on Neural Networks 66 Table 4.15 the mean and variance of the alpha values on neural networks 67 Table 4.16 the beta values on neural networks 67 Table 4.17 the mean and variance of the beta values on neural networks 68 Table 4.18 the correct rates on logistic regression 69 Table 4.19 the mean and variance of the correct rates on logistic regression 70 Table 4.20 the alpha values on logistic regression 70 Table 4.21 the mean and variance of the alpha values on logistic regression 71 Table 4.22 the beta values on logistic regression 72 Table 4.23 the mean and variance of the beta values on logistic regression 73 Table 4.24 the output of the regression 73 Table 4.25 the MSE values 74 Table 4.26 the mean and variance of MSE values 75 Table 4.27 the compared correct rates on mean and variance 75 Table 4.28 the compared mean on alpha and beta values 77 Table 4.29 the correct rates on four stratified variables in C5.0 79 Table 4.30 the correct rates on four stratified variables in neural networks 79 Table 4.31 the correct rates on four stratified variables in logistic regression 79 Table 4.32 the mean of correct rates 80 List of Figures Figure 2.1 the relational algebra 7 Figure 2.2 the organization of Data Warehouse 9 Figure 2.3 KDD process 12 Figure 2.4 data mining models and tasks 13 Figure 2.5 main methodology for data mining 15 Figure 2.6 the flow of CRISP-DM 16 Figure 2.7 the original scoter plot 26 Figure 2.8 the scoter plot after orthogonal 26 Figure 2.9 the scoter plot for correlated variable 26 Figure 2.10 the scoter plot for correlated variable without orthogonal in PSA 26 Figure 2.11 the scoter plot for correlated variable after orthogonal in PSA 27 Figure 2.12 the model of artificial neural network 29 Figure 2.13 the backpropagation network 31 Figure 2.14 stepwise regression method 35 Figure 2.15 the graph of logistic model 40 Figure 3.1 the graph of research concept 42 Figure 3.2 the research frame 43 Figure 4.1 the distribution of ground 53 Figure 4.2 the distribution of floor area of buildings 53 Figure 4.3 the distribution of workers 53 Figure 4.4 the distribution of salary 53 Figure 4.5 the distribution of operating expenditures 54 Figure 4.6 the distribution of operating revenues 54 Figure 4.7 the distribution of total assets 54 Figure 4.8 the distribution of fixed assets rented and borrowed 54 Figure 4.9 the distribution of fixed assets rented and lent 55 Figure 4.10 the distribution of expenditures on research development and technology acquiring 55 Figure 4.11 the distribution of expenditures on environment protection 55 Figure 4.12 the distribution of total value of production 55 Figure 4.13 the distribution of net value added 55 Figure 4.14 the distribution of net value of interest expenditures 56 Figure 4.15 the distribution of current assets 56 Figure 4.16 the distribution of profit 56 Figure 4.17 the distribution of triangular trade 56 Figure 4.18 the distribution of computer 56 Figure 4.19 the distribution of E-commerce 57 Figure 4.20 the distribution of profit 57 Figure 4.21 the result of C5.0 58 Figure 4.22 the correct rates on C5.0 60 Figure 4.23 the alpha values on C5.0 61 Figure 4.24 the beta values on C5.0 63 Figure 4.25 the correct rates on neural networks 65 Figure 4.26 the alpha values on Neural Networks 66 Figure 4.27 the beta values on neural networks 68 Figure 4.28 the correct rates on logistic regression 69 Figure 4.29 the alpha values on logistic regression 71 Figure 4.30 the beta values on logistic regression 72 Figure 4.31 the compared correct rates on mean 76 Figure 4.32 the compared correct rates on variance 76 Figure 4.33 the compared mean on alpha and beta values 78 Figure 4.34 the mean of correct rates 80 List of model Function (1) Kokasama-Hlawka inequality 24 Function (2) the Sigmoid function 30 Function (3) the logistic function 39 Function (4) the multiple logistic function 39 Function (5) the multiple logistic function 40 Function (6) the stepwise regression model 73 | zh_TW |
dc.language.iso | en_US | - |
dc.source.uri (資料來源) | http://thesis.lib.nccu.edu.tw/record/#G0091354016 | en_US |
dc.subject (關鍵詞) | 資料庫 | zh_TW |
dc.subject (關鍵詞) | 資料採礦 | zh_TW |
dc.subject (關鍵詞) | 抽樣方法 | zh_TW |
dc.subject (關鍵詞) | 資料加值 | zh_TW |
dc.subject (關鍵詞) | Database | en_US |
dc.subject (關鍵詞) | Data Mining | en_US |
dc.subject (關鍵詞) | Sampling | en_US |
dc.subject (關鍵詞) | Value-added database | en_US |
dc.title (題名) | 應用資料採礦技術於資料庫加值中的抽樣方法 | zh_TW |
dc.title (題名) | THE SAMPLING METHODS FOR VALUE-ADDED DATABASE IN DATA-MINING | en_US |
dc.type (資料類型) | thesis | en |
dc.relation.reference (參考文獻) | Chinese | zh_TW |
dc.relation.reference (參考文獻) | [1] 趙民德、謝邦昌,探索真相-抽樣理論和實務,曉園出版社,1999. | zh_TW |
dc.relation.reference (參考文獻) | [2] 黃文隆,抽樣方法,滄海書局,1999. | zh_TW |
dc.relation.reference (參考文獻) | [3] 趙民德,砂中選礦(Data Mining)的一些我見我思,中國統計學報,2002,12. | zh_TW |
dc.relation.reference (參考文獻) | [4] 王濟川、郭志剛,Logistic 迴歸模型-方法及應用,五南圖書出版股份有限公司,2003,3. | zh_TW |
dc.relation.reference (參考文獻) | [5] 崔巍 編著, 陳舜德 審校,資料庫系統與應用,博碩文化股份有限公司, | zh_TW |
dc.relation.reference (參考文獻) | 2001,4. | zh_TW |
dc.relation.reference (參考文獻) | [6] 張慶賀,資料倉儲中實體化視域自我維護之研究,朝陽科技大學,2003. | zh_TW |
dc.relation.reference (參考文獻) | English | zh_TW |
dc.relation.reference (參考文獻) | [1] Alan Mayne,Michael B Wood,Introducing Relational Database,1983. | zh_TW |
dc.relation.reference (參考文獻) | [2] Bernd Gartner and Emo Welzl,A Simple Sampling Lemma: Analysis and Applications in Geometric Optimization,2002,4. | zh_TW |
dc.relation.reference (參考文獻) | [3] Colleen McCue、Emilys. Stone、Teresap. Gooch,Data Mining and Value-Added Analysis,2003. | zh_TW |
dc.relation.reference (參考文獻) | [4] CHAP T. LE,APPLIED CATEGORICAL DATA ANALYSIS,Wiley-Interscience Publication,1998. | zh_TW |
dc.relation.reference (參考文獻) | [5] C. J. Date,Relational Database Writings 1991-1994,1995. | zh_TW |
dc.relation.reference (參考文獻) | [6] David Hand、Heikki Mannila、and Padhraic Smyth,PRINCIPLES OF Data Mining,2001. | zh_TW |
dc.relation.reference (參考文獻) | [7] Laboratory 2: Ecological population: a crash course in sampling and statistics. | zh_TW |
dc.relation.reference (參考文獻) | [8] Margaret H.Dunham,DATA MINING Introductory and Advanced Topics,2003. | zh_TW |
dc.relation.reference (參考文獻) | [9] Saerndal Carl-Erik、Bengt Swensson、Jan Wretman,Model Assisted Survey Sampling,New York: Springer-Verlag,1992. | zh_TW |
dc.relation.reference (參考文獻) | [10] USDA Technical Services Division: GRAIN INSPECTION PACKERS AND STOCKYARDS ADMINISIRATION,2001,1. | zh_TW |
dc.relation.reference (參考文獻) | [11] William Mendenhall、Terry Sincich,A SECOND COURSE IN STATISTICS REGRESSION ANALYSIS,PRENTICE FALL,fifth edition,1996. | zh_TW |