Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

題名 應用資料採礦技術於資料庫加值中的誤差指標及模型準則
ERROR INDEX AND MODEL CRITERIA FOR VALUE- ADDED DATABASE IN DATA MINING
作者 包寶茹
貢獻者 鄭宇庭<br>謝邦昌
<br>
包寶茹
關鍵詞 資料採礦
資料加值
誤差指標
模型準則
相似性
Data mining
Database value-added
Database
Error index
Model criteria
日期 2003
上傳時間 2009-09-14
摘要 運用資料來幫助企業做出正確且適當的政策是一個存在已久的觀念,在傳統統計上我們通常會將拿到的資料庫直接去作分析,然而對資料採礦(Data Mining)來說,常面臨資料不夠的瓶頸,亦導致資料庫的價值往往不夠。若,我們能利用調查的樣本,推估出目標資料庫中所欠缺的欄位在調查樣本中與其它欄位的關係,便可回推至目標資料庫將原本所欠缺的欄位補齊,將資料庫加大,亦即資料加值(value-added),那麼,未來要用到這些欄位來分析資料時只要抽樣進行分析即可,如此,也可有效降低企業的成本支出或浪費。
     本研究之目的在於整合過去各學者所提出之統計理論與方法,找出誤差指標及模型準則來說明擴充的欄位是有可信度的。由於在目標資料庫擴充欄位時,會產生誤差值,而誤差值的大小往往會影響我們用來判斷此擴充欄位的可行性及可信度,因此本研究並不考慮使用何種抽樣方法,而是假設在簡單隨機抽樣下來進行探討,判別在資料加值前後所造成預測值與實際值之間的差異情形,進一步來做比較。針對欲擴充目標的欄位型態分為連續型和類別型來尋找適當的指標及準備作為我們選擇判斷的指標。類別型欄位利用相似性觀念建立判斷指標,連續型欄位則利用距離觀念、相關性的架構下來討論,如此,可建立合理的誤差指標及模型準則針對欲擴充目標欄位的型態來判斷其擴充的欄位是否具有可信度,並評估其可用價值的高低。
      本研究實證結果發現資料庫加值為一可行的方法,從推估資料帶入模式後所得預測值與原始觀測值間計算其相似度皆在九成以上,說明擴充的欄位是有可信度的。
     
     關鍵詞:資料採礦、資料加值、誤差指標、模型準則、相似性
In recent years, the application of data mining has received good credits and acceptances from a variety of industries such as the finance industry, the insurance industry, and the electronics industry and so on for its success in extracting valuable information translated to opportunities from the database.
     Database value-added is a new idea not yet fully mature. Its applications on the different databases will have different effect, therefore, the goal of this research is to find the valid and accountable model criteria as a mean to determine if the added columns make any improvement to the database, hence the overall results in terms of predictions. After selecting the model based upon its appropriateness to the data type, we applied the error index and model criteria to evaluate for the performance of the model, if the model has accurately predicted the added-value column. The criterion used in this research is RMSE for the continuous data type and F-value for the discrete data type. Our findings in this research support our attempts that the error index and model criteria used in this research do give us an accountability measure in determining the reliability of adding the columns to the database.
     
     
     Keywords: Data mining, Database value-added, Database, Error index, Model criteria
參考文獻 [1] Anderson, Sweeney, Williams (2002), “Statistics for business and economics”, Eighth edition, South-Western, pp.616-617
[2] Alan Agresti (1996), “An Introduction To Categorical Data Analysis”, A Wiley- Interscience Publication, pp.103-135
[3] Berry, Michael (1997), “Data mining techniques : for marketing, sales, and customer support” New York : John Wiley & Sons
[4] Christine Michel (2001), “Ordered similarity measures taking into account the rank of documents”, Information Processing & Management, Vol. 37, Iss. 4; p.603
[5] Carol , Tenopir (1990),“Online databases” Library journal, April
[6] Han, Jiawei and Kamber, Micheline (2001) “Data mining : concepts and techniques” San Francisco, Calif. : Morgan Kaufmann Publishers
[7] Jin Zhang and Robert R.Korfhage (1999), “A distance and angle similarity measure method ”, Journal of the American society for information science, pp.772-778
[8] Jo-Chun Cheng (2003), “A Study of Similarity Measures of DNA Sequences under Evolution.”
[9] Kantardzic, Mehmed (2003), “Data mining: concepts, models, methods and algorithms” Hoboken, N.J.: Wiley-Interscience : IEEE Press.
[10] Kim, Myoung-Cheol, Choi, Key-Sun (1999), “A comparison of collocation- based similarity measures in query expansion”, Information Processing & Management, Vol. 35, Iss. 1; p.19 (12 pages)
[11] Min-Te Chao (2003), “Some Ideas for data base fusion” draft.
[12] Michael E Young, Edward A Wasserman (2002), “Limited attention and cue order consistency affect predictive learning: A test of similarity measures”, Journal of Experimental Psychology, Vol. 28, Iss. 3; p. 484
[13] Mathews, Brian P; Diamantopoulos, Adamantios (1994), “Towards a taxonomy of forecast error measures: A factor-comparative investigation of forecast error dimensions” Journal of Forecasting, Vol.13, pp.409-416
[14] Young Mee Chung, Jae Yun Lee (2001), “A corpus-based approach to comparative evaluation of statistical term association measures” Journal of the American society for information science and technology, pp.283-296
描述 碩士
國立政治大學
統計研究所
91354017
92
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0091354017
資料類型 thesis
dc.contributor.advisor 鄭宇庭<br>謝邦昌zh_TW
dc.contributor.advisor <br>en_US
dc.contributor.author (Authors) 包寶茹zh_TW
dc.creator (作者) 包寶茹zh_TW
dc.date (日期) 2003en_US
dc.date.accessioned 2009-09-14-
dc.date.available 2009-09-14-
dc.date.issued (上傳時間) 2009-09-14-
dc.identifier (Other Identifiers) G0091354017en_US
dc.identifier.uri (URI) https://nccur.lib.nccu.edu.tw/handle/140.119/30886-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 統計研究所zh_TW
dc.description (描述) 91354017zh_TW
dc.description (描述) 92zh_TW
dc.description.abstract (摘要) 運用資料來幫助企業做出正確且適當的政策是一個存在已久的觀念,在傳統統計上我們通常會將拿到的資料庫直接去作分析,然而對資料採礦(Data Mining)來說,常面臨資料不夠的瓶頸,亦導致資料庫的價值往往不夠。若,我們能利用調查的樣本,推估出目標資料庫中所欠缺的欄位在調查樣本中與其它欄位的關係,便可回推至目標資料庫將原本所欠缺的欄位補齊,將資料庫加大,亦即資料加值(value-added),那麼,未來要用到這些欄位來分析資料時只要抽樣進行分析即可,如此,也可有效降低企業的成本支出或浪費。
     本研究之目的在於整合過去各學者所提出之統計理論與方法,找出誤差指標及模型準則來說明擴充的欄位是有可信度的。由於在目標資料庫擴充欄位時,會產生誤差值,而誤差值的大小往往會影響我們用來判斷此擴充欄位的可行性及可信度,因此本研究並不考慮使用何種抽樣方法,而是假設在簡單隨機抽樣下來進行探討,判別在資料加值前後所造成預測值與實際值之間的差異情形,進一步來做比較。針對欲擴充目標的欄位型態分為連續型和類別型來尋找適當的指標及準備作為我們選擇判斷的指標。類別型欄位利用相似性觀念建立判斷指標,連續型欄位則利用距離觀念、相關性的架構下來討論,如此,可建立合理的誤差指標及模型準則針對欲擴充目標欄位的型態來判斷其擴充的欄位是否具有可信度,並評估其可用價值的高低。
      本研究實證結果發現資料庫加值為一可行的方法,從推估資料帶入模式後所得預測值與原始觀測值間計算其相似度皆在九成以上,說明擴充的欄位是有可信度的。
     
     關鍵詞:資料採礦、資料加值、誤差指標、模型準則、相似性
zh_TW
dc.description.abstract (摘要) In recent years, the application of data mining has received good credits and acceptances from a variety of industries such as the finance industry, the insurance industry, and the electronics industry and so on for its success in extracting valuable information translated to opportunities from the database.
     Database value-added is a new idea not yet fully mature. Its applications on the different databases will have different effect, therefore, the goal of this research is to find the valid and accountable model criteria as a mean to determine if the added columns make any improvement to the database, hence the overall results in terms of predictions. After selecting the model based upon its appropriateness to the data type, we applied the error index and model criteria to evaluate for the performance of the model, if the model has accurately predicted the added-value column. The criterion used in this research is RMSE for the continuous data type and F-value for the discrete data type. Our findings in this research support our attempts that the error index and model criteria used in this research do give us an accountability measure in determining the reliability of adding the columns to the database.
     
     
     Keywords: Data mining, Database value-added, Database, Error index, Model criteria
en_US
dc.description.tableofcontents ABSTRACT…………………………………………………………………………...i
     LIST OF TABLES……………………………………………………………………iv
     LIST OF FIGURES…………………………………………………………………...v
     
     CHAPTER 1 INTRODUCTION………………………………………………...1
     1.1. Background ...1
     1.2. Motive Of This Thesis ...1
     1.3. Purpose Of This Thesis ...2
     1.4. Thesis Layout ...3
     CHAPTER 2 LITERATURE REVIEW………………………………………....5
     2.1. Relational Database and Data warehouse for data mining ....5
      2.1.1 Relational database ....5
      2.1.2 Data warehouse ....9
     2.2. Data mining ....9
     2.3. Regression Methods …13
      2.3.1 Regression Methods …14
      2.3.2 Logistic Regression …15
     2.4. Data mining Method……………………………………………………...16
      2.4.1 Decision Tree …16
      2.4.2 Artificial Neural Network …17
     2.5. Forecasting Error Index And Model Criteria …19
      2.5.1 Similarity Measures …20
      2.5.2 Error Measure …21
      2.5.2 Distance Measures or Dissimilarity Measures …23
      2.5.2 Error Measure …27
     CHAPTER 3 RESEARCH METHODOLOGY……………………………….29
     3.1. Data And Sampling Selection …29
     3.2. Research Method…………………………………………………….…...31
     3.3. Research Frame …34
     
     CHAPTER 4 EVALUATING PERFORMANCE…………………………….38
     4.1. Descriptive Statistics Analysis .38
     4.2. Building Prediction Model………………………………………………40
      4.2.1 Stepwise Regression Model ...40
      4.2.2 Neural Network Model ...42
      4.2.3 Comparison…………………………………………………………….43
     4.3. Building Classification Model ...45
      4.3.1 C5.0 Model ...46
      4.3.2 Neural Network Model ...47
      4.3.3 Logistic Model…………………………………………………………48
      4.3.4 Comparison…………………………………………………………….49
     CHAPTER 5 CONCLUSION AND RESEARCH DIRECTION…………….55
     5.1. Conclusion And Suggestion .55
      5.1.1 Conclusion ...55
      5.1.2 Suggestion ...57
     5.2. Future Work……………………………………………………………...58
     
     REFERENCE……………………………………………………………………….60
     APPENDIX………………………………………………………………………….62
     List of Tables
      Table 2.1 Development of database technology 5
     Table 2.2 Student’s scores table 7
     Table 2.3 Student’s fundamental information table 8
     Table 2.4 The example of the “primary key” 8
     Table 2.5 Show the combination table 8
     Table 2.6 Six kinds of domain in Data Mining 10
     Table 2.7 The example of dummy variable 14
     Table 2.8 Forecast error measures 23
     Table 2.9 Distance measures (or dissimilarity measures) 24
     Table 3.1 Introduce Variable and explanation 30
     Table 4.1 Descriptive Statistics 40
     Table 4.2 Pearson correlation 41
     Table 4.3 Stepwise Regression Model Summary 41
     Table 4.4 The average of training 30 times 42
     Table 4.5 Relative importance of inputs 42
     Table 4.6 The average of training 30 times 42
     Table 4.7 The average of training 30 times 46
     Table 4.8 The average of training 30 times 48
     Table 4.9 Relative importance of inputs 48
     Table 4.10 The average of training 30 times 49
     
     List of Figures
     Figure 1.1 Flow chart……………………………………………………4
     Figure 2.1 KDD process………………………………………………..10
     Figure 2.2 KDD process (Clementine, CRISP-DM)…………………...11
     Figure 2.3 A simple decision tree……………………………………....16
     Figure 2.4 A basic structure of FNN with two hidden layers…………..18
     Figure 2.5 BPNN……………………………………………………….19
     Figure 2.6 Show the confusion matrix 28
     Figure 3.1 The confusion matrix 34
     Figure 3.2 Research Frame 35
     Figure 4.1 Cosine index of the two models 43
     Figure 4.2 Cosine index of the two models 44
     Figure 4.3 Distance measure 44
     Figure 4.4 RMSE of the two models 45
     Figure 4.5 “Whether has the profit” 46
     Figure 4.6 C5.0 decision tree 47
     Figure 4.7 Training and testing of C5.0 model 50
     Figure 4.8 Training and testing of NN model 50
     Figure 4.9 Training and testing of Logistic Regression model ..51
     Figure 4.10 Jaccard coefficient ..52
     Figure 4.11 Compare precision of three models ..53
     Figure 4.12 Compare Recall of three models ..53
     Figure 4.13 Compare F-value of three models ..54
     Figure 5.1 Sampling survey 58
     Figure 5.2 Database mapping 59
zh_TW
dc.language.iso en_US-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0091354017en_US
dc.subject (關鍵詞) 資料採礦zh_TW
dc.subject (關鍵詞) 資料加值zh_TW
dc.subject (關鍵詞) 誤差指標zh_TW
dc.subject (關鍵詞) 模型準則zh_TW
dc.subject (關鍵詞) 相似性zh_TW
dc.subject (關鍵詞) Data miningen_US
dc.subject (關鍵詞) Database value-addeden_US
dc.subject (關鍵詞) Databaseen_US
dc.subject (關鍵詞) Error indexen_US
dc.subject (關鍵詞) Model criteriaen_US
dc.title (題名) 應用資料採礦技術於資料庫加值中的誤差指標及模型準則zh_TW
dc.title (題名) ERROR INDEX AND MODEL CRITERIA FOR VALUE- ADDED DATABASE IN DATA MININGen_US
dc.type (資料類型) thesisen
dc.relation.reference (參考文獻) [1] Anderson, Sweeney, Williams (2002), “Statistics for business and economics”, Eighth edition, South-Western, pp.616-617zh_TW
dc.relation.reference (參考文獻) [2] Alan Agresti (1996), “An Introduction To Categorical Data Analysis”, A Wiley- Interscience Publication, pp.103-135zh_TW
dc.relation.reference (參考文獻) [3] Berry, Michael (1997), “Data mining techniques : for marketing, sales, and customer support” New York : John Wiley & Sonszh_TW
dc.relation.reference (參考文獻) [4] Christine Michel (2001), “Ordered similarity measures taking into account the rank of documents”, Information Processing & Management, Vol. 37, Iss. 4; p.603zh_TW
dc.relation.reference (參考文獻) [5] Carol , Tenopir (1990),“Online databases” Library journal, Aprilzh_TW
dc.relation.reference (參考文獻) [6] Han, Jiawei and Kamber, Micheline (2001) “Data mining : concepts and techniques” San Francisco, Calif. : Morgan Kaufmann Publisherszh_TW
dc.relation.reference (參考文獻) [7] Jin Zhang and Robert R.Korfhage (1999), “A distance and angle similarity measure method ”, Journal of the American society for information science, pp.772-778zh_TW
dc.relation.reference (參考文獻) [8] Jo-Chun Cheng (2003), “A Study of Similarity Measures of DNA Sequences under Evolution.”zh_TW
dc.relation.reference (參考文獻) [9] Kantardzic, Mehmed (2003), “Data mining: concepts, models, methods and algorithms” Hoboken, N.J.: Wiley-Interscience : IEEE Press.zh_TW
dc.relation.reference (參考文獻) [10] Kim, Myoung-Cheol, Choi, Key-Sun (1999), “A comparison of collocation- based similarity measures in query expansion”, Information Processing & Management, Vol. 35, Iss. 1; p.19 (12 pages)zh_TW
dc.relation.reference (參考文獻) [11] Min-Te Chao (2003), “Some Ideas for data base fusion” draft.zh_TW
dc.relation.reference (參考文獻) [12] Michael E Young, Edward A Wasserman (2002), “Limited attention and cue order consistency affect predictive learning: A test of similarity measures”, Journal of Experimental Psychology, Vol. 28, Iss. 3; p. 484zh_TW
dc.relation.reference (參考文獻) [13] Mathews, Brian P; Diamantopoulos, Adamantios (1994), “Towards a taxonomy of forecast error measures: A factor-comparative investigation of forecast error dimensions” Journal of Forecasting, Vol.13, pp.409-416zh_TW
dc.relation.reference (參考文獻) [14] Young Mee Chung, Jae Yun Lee (2001), “A corpus-based approach to comparative evaluation of statistical term association measures” Journal of the American society for information science and technology, pp.283-296zh_TW