應用資料採礦技術於資料庫加值中的抽樣方法

Publications-Theses

Article View/Open

html(273)

Publication Export

Google Scholar^TM

題名	應用資料採礦技術於資料庫加值中的抽樣方法 THE SAMPLING METHODS FOR VALUE-ADDED DATABASE IN DATA-MINING
作者	陳惠雯
貢獻者	鄭宇庭<br>謝邦昌 <br> 陳惠雯
關鍵詞	資料庫資料採礦抽樣方法資料加值 Database Data Mining Sampling Value-added database
日期	2003
上傳時間	2009-09-14
摘要	In the wake of growing database that has already become the trend of today’s business environment within the foreseeable future, reviewing quality information from mountains of data residing on corporations or organizations’ network such as sales figures, manufacturing statistics, financial data and experimental data is clearly costly, time consuming and definitely ineffective approach. Therefore we would need a sound and effective method in obtaining only portions of the data that are representative to the population and which allow us to build the reliable model based upon the sampled data. However, sometimes we have a situation where the database is of limited in size, under such circumstance, we initiate the idea which is relatively new to adding the attributes or values into the database to enhance the quality of the data Follow through such a procedure; it is obvious that implementing a good sampling method is an important groundwork leading us to reach final destination that is obtaining a reliable predictive model. And this is our research goal that is to get an effective and representative value-added sample of by means of sampling method for building an accuracy predictive model. The concept is pretty straightforward that is if we want to get good predictive samples then we need the correct sampling methods. The sampling methods under study are simple random sample, system sample, stratified sample and uniform design. The models used are the C5.0, logistic regression, and neural network for categorical predictive variable and stepwise regression for continuous predictive variable. The results are discussed in the conclusion section. Keywords: Database、Data Mining、Sampling、Value-added database
參考文獻	Chinese [1] 趙民德、謝邦昌，探索真相-抽樣理論和實務，曉園出版社，1999. [2] 黃文隆，抽樣方法，滄海書局，1999. [3] 趙民德，砂中選礦（Data Mining）的一些我見我思，中國統計學報，2002，12. [4] 王濟川、郭志剛，Logistic 迴歸模型-方法及應用，五南圖書出版股份有限公司，2003，3. [5] 崔巍編著，陳舜德審校，資料庫系統與應用，博碩文化股份有限公司， 2001，4. [6] 張慶賀，資料倉儲中實體化視域自我維護之研究，朝陽科技大學，2003. English [1] Alan Mayne，Michael B Wood，Introducing Relational Database，1983. [2] Bernd Gartner and Emo Welzl，A Simple Sampling Lemma: Analysis and Applications in Geometric Optimization，2002，4. [3] Colleen McCue、Emilys. Stone、Teresap. Gooch，Data Mining and Value-Added Analysis，2003. [4] CHAP T. LE，APPLIED CATEGORICAL DATA ANALYSIS，Wiley-Interscience Publication，1998. [5] C. J. Date，Relational Database Writings 1991-1994，1995. [6] David Hand、Heikki Mannila、and Padhraic Smyth，PRINCIPLES OF Data Mining，2001. [7] Laboratory 2: Ecological population: a crash course in sampling and statistics. [8] Margaret H.Dunham，DATA MINING Introductory and Advanced Topics，2003. [9] Saerndal Carl-Erik、Bengt Swensson、Jan Wretman，Model Assisted Survey Sampling，New York: Springer-Verlag，1992. [10] USDA Technical Services Division: GRAIN INSPECTION PACKERS AND STOCKYARDS ADMINISIRATION，2001，1. [11] William Mendenhall、Terry Sincich，A SECOND COURSE IN STATISTICS REGRESSION ANALYSIS，PRENTICE FALL，fifth edition，1996.
描述	碩士國立政治大學統計研究所 91354016 92
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0091354016
資料類型	thesis

dc.contributor.advisor	鄭宇庭<br>謝邦昌	zh_TW
dc.contributor.advisor	<br>	en_US
dc.contributor.author (Authors)	陳惠雯	zh_TW
dc.creator (作者)	陳惠雯	zh_TW
dc.date (日期)	2003	en_US
dc.date.accessioned	2009-09-14	-
dc.date.available	2009-09-14	-
dc.date.issued (上傳時間)	2009-09-14	-
dc.identifier (Other Identifiers)	G0091354016	en_US
dc.identifier.uri (URI)	https://nccur.lib.nccu.edu.tw/handle/140.119/30885	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	統計研究所	zh_TW
dc.description (描述)	91354016	zh_TW
dc.description (描述)	92	zh_TW
dc.description.abstract (摘要)	In the wake of growing database that has already become the trend of today’s business environment within the foreseeable future, reviewing quality information from mountains of data residing on corporations or organizations’ network such as sales figures, manufacturing statistics, financial data and experimental data is clearly costly, time consuming and definitely ineffective approach. Therefore we would need a sound and effective method in obtaining only portions of the data that are representative to the population and which allow us to build the reliable model based upon the sampled data. However, sometimes we have a situation where the database is of limited in size, under such circumstance, we initiate the idea which is relatively new to adding the attributes or values into the database to enhance the quality of the data Follow through such a procedure; it is obvious that implementing a good sampling method is an important groundwork leading us to reach final destination that is obtaining a reliable predictive model. And this is our research goal that is to get an effective and representative value-added sample of by means of sampling method for building an accuracy predictive model. The concept is pretty straightforward that is if we want to get good predictive samples then we need the correct sampling methods. The sampling methods under study are simple random sample, system sample, stratified sample and uniform design. The models used are the C5.0, logistic regression, and neural network for categorical predictive variable and stepwise regression for continuous predictive variable. The results are discussed in the conclusion section. Keywords: Database、Data Mining、Sampling、Value-added database	en_US
dc.description.tableofcontents	ABSTRACT LIST OF TABLES LIST OF FIGURES LIST OF MODEL Chapter 1 INTRODUCTION 1 1.1. Research Background 1 1.2. Research Motive 1 1.3. Research Purpose 2 1.4. Research Flow 3 Chapter 2 LITERATURE REVIEW 4 2.1. Database and Relational Database 4 2.2. Data Warehouse 8 2.3. Data Mining 11 2.4. Introduction to Sampling Method 19 2.4.1. Simple Random Sample 21 2.4.2. Systematic Sample 22 2.4.3. Stratified Sample 23 2.4.4. Uniform Design 23 2.5. The Predictive Model 28 2.5.1. Neural Networks 28 2.5.1.1. Introduce to Neural Network 28 2.5.1.2. Backpropagation Network 30 2.5.2. Cluster Methods 32 2.5.2.1. C5.0 32 2.5.2.2. CART 33 2.5.3. Regression Model 33 2.5.3.1. Stepwise Regression 34 2.5.3.2. Logistic Regression 37 Chapter 3 RESEARCH METHODOLOGY 41 3.1. Research Concept 41 3.2. Research Frame 43 Chapter 4 EXPERIMENTAL RESULTS 46 4.1. Introduction to Database 46 4.2. The Research Content 49 4.2.1. The Distribution of Data 49 4.2.2. Sampling 57 4.3. Compare the Sampling Methods 58 4.3.1. C5.0 58 4.3.2. Neural Networks 63 4.3.3. Logistic Regression 69 4.3.4. Stepwise Regression 73 4.3.5. Compare the Models Accuracy 75 4.4. The Discussion of Stratified Sampling Method 78 Chapter 5 CONCLUSION AND RESEARCH DIRECTION 81 5.1. Conclusion 81 5.2. Suggestion 84 5.3. Future Work 84 REFERENCES 86 APPENDIX………………………………………………………………………..88 List of Tables Table 2.1 the dummy variable table 40 Table 3.1 the classify table 45 Table 4.1 all variables 47 Table 4.2 the research variables 49 Table 4.3 the continuous variables 52 Table 4.4 the sample size of the different sample methods 57 Table 4.5 the correct rates on C5.0 59 Table 4.6 the mean and variance of correct rates on C5.0 60 Table 4.7 the alpha values on C5.0 61 Table 4.8 the mean and variance of the alpha values on C5.0 62 Table 4.9 the beta values on C5.0 62 Table 4.10 the mean and variance of the beta values on C5.0 63 Table 4.11 the result of neural networks 64 Table 4.12 the correct rates on neural networks 64 Table 4.13 the mean and variance of the correct rates on neural networks 65 Table 4.14 the alpha values on Neural Networks 66 Table 4.15 the mean and variance of the alpha values on neural networks 67 Table 4.16 the beta values on neural networks 67 Table 4.17 the mean and variance of the beta values on neural networks 68 Table 4.18 the correct rates on logistic regression 69 Table 4.19 the mean and variance of the correct rates on logistic regression 70 Table 4.20 the alpha values on logistic regression 70 Table 4.21 the mean and variance of the alpha values on logistic regression 71 Table 4.22 the beta values on logistic regression 72 Table 4.23 the mean and variance of the beta values on logistic regression 73 Table 4.24 the output of the regression 73 Table 4.25 the MSE values 74 Table 4.26 the mean and variance of MSE values 75 Table 4.27 the compared correct rates on mean and variance 75 Table 4.28 the compared mean on alpha and beta values 77 Table 4.29 the correct rates on four stratified variables in C5.0 79 Table 4.30 the correct rates on four stratified variables in neural networks 79 Table 4.31 the correct rates on four stratified variables in logistic regression 79 Table 4.32 the mean of correct rates 80 List of Figures Figure 2.1 the relational algebra 7 Figure 2.2 the organization of Data Warehouse 9 Figure 2.3 KDD process 12 Figure 2.4 data mining models and tasks 13 Figure 2.5 main methodology for data mining 15 Figure 2.6 the flow of CRISP-DM 16 Figure 2.7 the original scoter plot 26 Figure 2.8 the scoter plot after orthogonal 26 Figure 2.9 the scoter plot for correlated variable 26 Figure 2.10 the scoter plot for correlated variable without orthogonal in PSA 26 Figure 2.11 the scoter plot for correlated variable after orthogonal in PSA 27 Figure 2.12 the model of artificial neural network 29 Figure 2.13 the backpropagation network 31 Figure 2.14 stepwise regression method 35 Figure 2.15 the graph of logistic model 40 Figure 3.1 the graph of research concept 42 Figure 3.2 the research frame 43 Figure 4.1 the distribution of ground 53 Figure 4.2 the distribution of floor area of buildings 53 Figure 4.3 the distribution of workers 53 Figure 4.4 the distribution of salary 53 Figure 4.5 the distribution of operating expenditures 54 Figure 4.6 the distribution of operating revenues 54 Figure 4.7 the distribution of total assets 54 Figure 4.8 the distribution of fixed assets rented and borrowed 54 Figure 4.9 the distribution of fixed assets rented and lent 55 Figure 4.10 the distribution of expenditures on research development and technology acquiring 55 Figure 4.11 the distribution of expenditures on environment protection 55 Figure 4.12 the distribution of total value of production 55 Figure 4.13 the distribution of net value added 55 Figure 4.14 the distribution of net value of interest expenditures 56 Figure 4.15 the distribution of current assets 56 Figure 4.16 the distribution of profit 56 Figure 4.17 the distribution of triangular trade 56 Figure 4.18 the distribution of computer 56 Figure 4.19 the distribution of E-commerce 57 Figure 4.20 the distribution of profit 57 Figure 4.21 the result of C5.0 58 Figure 4.22 the correct rates on C5.0 60 Figure 4.23 the alpha values on C5.0 61 Figure 4.24 the beta values on C5.0 63 Figure 4.25 the correct rates on neural networks 65 Figure 4.26 the alpha values on Neural Networks 66 Figure 4.27 the beta values on neural networks 68 Figure 4.28 the correct rates on logistic regression 69 Figure 4.29 the alpha values on logistic regression 71 Figure 4.30 the beta values on logistic regression 72 Figure 4.31 the compared correct rates on mean 76 Figure 4.32 the compared correct rates on variance 76 Figure 4.33 the compared mean on alpha and beta values 78 Figure 4.34 the mean of correct rates 80 List of model Function (1) Kokasama-Hlawka inequality 24 Function (2) the Sigmoid function 30 Function (3) the logistic function 39 Function (4) the multiple logistic function 39 Function (5) the multiple logistic function 40 Function (6) the stepwise regression model 73	zh_TW
dc.language.iso	en_US	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0091354016	en_US
dc.subject (關鍵詞)	資料庫	zh_TW
dc.subject (關鍵詞)	資料採礦	zh_TW
dc.subject (關鍵詞)	抽樣方法	zh_TW
dc.subject (關鍵詞)	資料加值	zh_TW
dc.subject (關鍵詞)	Database	en_US
dc.subject (關鍵詞)	Data Mining	en_US
dc.subject (關鍵詞)	Sampling	en_US
dc.subject (關鍵詞)	Value-added database	en_US
dc.title (題名)	應用資料採礦技術於資料庫加值中的抽樣方法	zh_TW
dc.title (題名)	THE SAMPLING METHODS FOR VALUE-ADDED DATABASE IN DATA-MINING	en_US
dc.type (資料類型)	thesis	en
dc.relation.reference (參考文獻)	Chinese	zh_TW
dc.relation.reference (參考文獻)	[1] 趙民德、謝邦昌，探索真相-抽樣理論和實務，曉園出版社，1999.	zh_TW
dc.relation.reference (參考文獻)	[2] 黃文隆，抽樣方法，滄海書局，1999.	zh_TW
dc.relation.reference (參考文獻)	[3] 趙民德，砂中選礦（Data Mining）的一些我見我思，中國統計學報，2002，12.	zh_TW
dc.relation.reference (參考文獻)	[4] 王濟川、郭志剛，Logistic 迴歸模型-方法及應用，五南圖書出版股份有限公司，2003，3.	zh_TW
dc.relation.reference (參考文獻)	[5] 崔巍編著，陳舜德審校，資料庫系統與應用，博碩文化股份有限公司，	zh_TW
dc.relation.reference (參考文獻)	2001，4.	zh_TW
dc.relation.reference (參考文獻)	[6] 張慶賀，資料倉儲中實體化視域自我維護之研究，朝陽科技大學，2003.	zh_TW
dc.relation.reference (參考文獻)	English	zh_TW
dc.relation.reference (參考文獻)	[1] Alan Mayne，Michael B Wood，Introducing Relational Database，1983.	zh_TW
dc.relation.reference (參考文獻)	[2] Bernd Gartner and Emo Welzl，A Simple Sampling Lemma: Analysis and Applications in Geometric Optimization，2002，4.	zh_TW
dc.relation.reference (參考文獻)	[3] Colleen McCue、Emilys. Stone、Teresap. Gooch，Data Mining and Value-Added Analysis，2003.	zh_TW
dc.relation.reference (參考文獻)	[4] CHAP T. LE，APPLIED CATEGORICAL DATA ANALYSIS，Wiley-Interscience Publication，1998.	zh_TW
dc.relation.reference (參考文獻)	[5] C. J. Date，Relational Database Writings 1991-1994，1995.	zh_TW
dc.relation.reference (參考文獻)	[6] David Hand、Heikki Mannila、and Padhraic Smyth，PRINCIPLES OF Data Mining，2001.	zh_TW
dc.relation.reference (參考文獻)	[7] Laboratory 2: Ecological population: a crash course in sampling and statistics.	zh_TW
dc.relation.reference (參考文獻)	[8] Margaret H.Dunham，DATA MINING Introductory and Advanced Topics，2003.	zh_TW
dc.relation.reference (參考文獻)	[9] Saerndal Carl-Erik、Bengt Swensson、Jan Wretman，Model Assisted Survey Sampling，New York: Springer-Verlag，1992.	zh_TW
dc.relation.reference (參考文獻)	[10] USDA Technical Services Division: GRAIN INSPECTION PACKERS AND STOCKYARDS ADMINISIRATION，2001，1.	zh_TW
dc.relation.reference (參考文獻)	[11] William Mendenhall、Terry Sincich，A SECOND COURSE IN STATISTICS REGRESSION ANALYSIS，PRENTICE FALL，fifth edition，1996.	zh_TW

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM