Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

題名 預測模型的遺失值處理─選值順序的研究
Handling Missing Values in Predictive Model - Research of the Order of Data Acquisition
作者 黃秋芸
Huang, Chiu Yun
貢獻者 唐揆
Tang, Kwei
黃秋芸
Huang, Chiu Yun
關鍵詞 預測模型
遺失值
Active Feature-value Acquisition
決策樹
Predictive Model
Missing Value
Active Feature-value Acquisition
Decision Tree
日期 2013
上傳時間 7-Jul-2014 11:10:36 (UTC+8)
摘要 商業知識的發展突飛猛進,其中,預測模型在眾多商業智慧中扮演重要的角色,然而,當我們從大量資料萃取隱藏、未知與潛在具有實用性的資訊處理過程時,往往會遇到許多資料品質上的問題而難以著手分析,尤其是遺失值 (Missing value)的問題在資料前置處理階段更是常見的困難。因此,要如何在建立預測模型時有效的處理遺失值是一個很重要的議題。
過去已有許多文獻致力於遺失值處理的議題,其中,Active Feature-Value Acquisition的相關研究更針對訓練資料的選填順序深入探討。Active Feature-Value Acquisition的概念是從具有遺失值的訓練資料中,選擇適當的遺失資料填補,讓預測的模型在最具效率的情況下達到理想的準確率。本研究將延續Active Feature-Value Acquisition的研究主軸,優先考量決策樹上的節點為遺失值選值填補的順序,提出一個新的訓練資料遺失值的選填順序方法─I Sampling,並透過實際的數據進行訓練與測試,同時我們也與過去文獻所提出的方法進行比較,了解不同的填值偵測與順序的選擇對於一個預測模型的分類準確率是否有影響,並了解各個方法的優缺點與在不同情境下的適用性。
本研究所提出的新方法與驗證的結果,將可給予未來從事預測行為的管理或學術工作一些參考與建議,可以依據不同性質的資料採取合宜的選值方式,以節省取值的成本並提高預測模型的分類能力。
The importance of business intelligence is accelerated developing nowadays. Especially predictive models play a key role in numerous business intelligence tasks. However, while we extract information from unidentified data, there are critical problems of how to handle the missing values, especially in the data pre-processing phase. Therefore, it is important to identify which methods best deal with the missing data when building predictive models.
There are several papers dedicated in the research of strategies to deal with the missing values. The topic of Active-Feature Acquisition (aka. AFA) especially worked on the priority order of choosing which feature-value to acquire. The goal of AFA is to reduce the costs of achieving a desired model accuracy by identifying instances for which obtaining complete information is most informative. Followed by the AFA concept, we present an approach- I Sampling, in which feature-values are selected for acquisition based on the attribute on the top node of the current decision tree. Also we compare our approach with other methods in different situations and data missing patterns.
Experimental results demonstrate that our approach can induce accurate models using substantially fewer feature-value acquisitions as compared to alternative policies in some situations. The method we proposed can aid the further predictive works in academic and business area. They can therefore choose the right method based on their needs and obtain the informative data in an efficient way.
參考文獻 [英文文獻]
1.Bennett, D. A. (2001), “How can I deal with missing data in my study? “Australian and New Zealand Journal of Public Health, 25(5), 464–469.
2.Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). The KDD Process for Extracting Useful Knowledge from Volumes of Data. Communications of the ACM, 39(11), 27-35.
3.Gilks, W. R., Richardson, S.,& Spiegelhalter, D. J. (1996). Introducing Markov chain Monte Carlo. In Markov chain Monte Carlo in practice (pp. 1-19). London: Chapman & hall/CRC.
4.Kohavi, R. (1995, August). A study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In IJCAI, (Vol.14, No.2, pp. 1137-1145).
5.Levin, N., & Zahavi, J. (2001). Predictive modeling using segmentation. Journal of Interactive Marketing, 15(2), 2-22.
6.Lindenbaum, M., Markovitch, S., & Rusakov, D. (2004). Selective Sampling for Nearest Neighbor Classifiers. Machine Learning, 54(2), 125-152.
7.Lizotte, D. J., Madani, O., & Greiner, R. (2002, August). Budgeted learning of Naive-Bayes Classifiers. In Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence (pp. 378-385). Morgan Kaufmann Publishers Inc..
8.Melville, P., Saar-Tsechansky, M., Provost, F., & Mooney, R. (2004, November). Active Feature-Value Acquisition for Classifier Induction. In Proceedings of the 4th IEEE International Conference on Data Mining. (pp. 483-486). Brighton, UK.
9.Peng, C. Y. J., Harwell, M., Liou, S.M., & Ehman, L.H. (2006). Advances in missing data methods and implications for educational research. In Real data analysis, 31-78. North Carolina,US : Information Age Publishing.
10.Pyle , D. (1999). Data Preparation for Data Mining. Massachusetts: Morgan Kaufmann.
11.Quinlan, J. R. (1989, December). Unknown attribute values in induction. In ML (pp. 164-168).
12.Redman, T. C. (1996). Data quality for the information age. Massachusetts: Artech House, Incorporated.
13.Rubin, D. B. (1987). Multiple imputation for non-response in surveys. New York: John Wiley & Sons.
14.Saar-Tsechansky, M., Melville, P., & Provost, F. (2009, 4). Active Feature-Value Acquisition. Management Science,55(4), 664-684.
15.Schafer, J. L. (1999). Multiple imputation: a primer. Statiscal methods in medical research, 8(1), 3-15.
16.Schlomer, G. L., Bauman, S., & Card, N. A. (2010). Best Practices for Missing Data Management in Counseling Psychology. Journal of Counseling Psychology, 57(1), 1-10.
17.Settles, B. (2010). Active Learning Literature Survey. Computer Sciences Technical Report 1648, Unversity of Wisconsin, Madison, 52, 55-66.
18.Simon, H. A., & Lea, G. (1974). Problem solving and rule induction: A unified view. Knowledge and cognition. Oxford, England: Lawrence Erlbaum.
19.Tong, S., & Koller, D. (2001, August). Active learning for structure in Bayesian networks. In International joint conference on artificial intelligence, (vol. 17, No.1, pp. 863-869).
20.Vinod, N. C., & Punithavalli, D. M. (2011). Classification of Incomplete Data Handling Techniques-An Overview. International Journal on Computer Science and Engineering, 3(1), 340-344.
21.Zheng, Z., & Padmanabhan, B. (2002). On Active Learning for Data Acquisition. In Proceedings of IEEE International Condference on Data Mining. (pp. 562-569).
22.Zhu, X., & Wu, X. (2005). Cost-Constrained Data Acquisition for Intelligent Data Preparation. IEEE Transactions on Knowledge and Data Engineering, 17(11), 1542-1556.
[中文文獻]
1.麥爾荀伯格、庫基耶 (2013),大數據 (初版) (林俊宏譯),台北市:天下文化 (原著出版年:2013年)。
2.王鴻龍、楊孟麗、陳俊如、林定香 (2012),缺失資料在因素分析上的處理方法之研究,教育科學研究期刊,第五十七卷第一期,頁29-50。
3.吳元彰、沈永勝、楊鍵樵 (2007),應用加權式灰關聯法與自動分群技術於遺失值填補問題,技術學刊,第二十二卷第一期,頁77-87。
4.彼得杜拉克(1980),動盪時代下的經營(初版)(李辛模譯),台北市: 現代企業經營管理 (原著出版年:1980年)。
5.林惠玲、陳正倉 (2004),統計學:方法與應用,台北市:雙葉書廊。
6.林曉芳 (2002),以 Hot deck 插補法推估成就測驗之不完整作答反應,國立政治大學教育學系教育心理與輔導組博士學位論文,未出版,台北市。
7.翁頌舜、梁德馨 (2002),資料採礦資料缺值插補之變異數分析,輔仁管理評論,第九卷第三期,頁163-180。
8.馬芳資、林我聰 (2003),決策樹形式知識之線上預測系統架構,圖書館學與資訊科學,第二十九卷第二期,頁60-76。
9.陳信木、林佳瑩 (1997),調查資料之遺漏值的處置─以熱卡插補法為例,調查研究─方法與應用,第三期,頁75-106。
10.黃齡葦 (2005),遺失資料之多重插補法模擬比較,國立台灣大學農藝學研究所碩士論文,未出版,台北市。
[網路資料]
1.UCI machine Learning Repository. (n.d.). Retrieved from https://archive.ics.uci.edu/ml/index.html
描述 碩士
國立政治大學
企業管理研究所
101355006
102
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0101355006
資料類型 thesis
dc.contributor.advisor 唐揆zh_TW
dc.contributor.advisor Tang, Kweien_US
dc.contributor.author (Authors) 黃秋芸zh_TW
dc.contributor.author (Authors) Huang, Chiu Yunen_US
dc.creator (作者) 黃秋芸zh_TW
dc.creator (作者) Huang, Chiu Yunen_US
dc.date (日期) 2013en_US
dc.date.accessioned 7-Jul-2014 11:10:36 (UTC+8)-
dc.date.available 7-Jul-2014 11:10:36 (UTC+8)-
dc.date.issued (上傳時間) 7-Jul-2014 11:10:36 (UTC+8)-
dc.identifier (Other Identifiers) G0101355006en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/67317-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 企業管理研究所zh_TW
dc.description (描述) 101355006zh_TW
dc.description (描述) 102zh_TW
dc.description.abstract (摘要) 商業知識的發展突飛猛進,其中,預測模型在眾多商業智慧中扮演重要的角色,然而,當我們從大量資料萃取隱藏、未知與潛在具有實用性的資訊處理過程時,往往會遇到許多資料品質上的問題而難以著手分析,尤其是遺失值 (Missing value)的問題在資料前置處理階段更是常見的困難。因此,要如何在建立預測模型時有效的處理遺失值是一個很重要的議題。
過去已有許多文獻致力於遺失值處理的議題,其中,Active Feature-Value Acquisition的相關研究更針對訓練資料的選填順序深入探討。Active Feature-Value Acquisition的概念是從具有遺失值的訓練資料中,選擇適當的遺失資料填補,讓預測的模型在最具效率的情況下達到理想的準確率。本研究將延續Active Feature-Value Acquisition的研究主軸,優先考量決策樹上的節點為遺失值選值填補的順序,提出一個新的訓練資料遺失值的選填順序方法─I Sampling,並透過實際的數據進行訓練與測試,同時我們也與過去文獻所提出的方法進行比較,了解不同的填值偵測與順序的選擇對於一個預測模型的分類準確率是否有影響,並了解各個方法的優缺點與在不同情境下的適用性。
本研究所提出的新方法與驗證的結果,將可給予未來從事預測行為的管理或學術工作一些參考與建議,可以依據不同性質的資料採取合宜的選值方式,以節省取值的成本並提高預測模型的分類能力。
zh_TW
dc.description.abstract (摘要) The importance of business intelligence is accelerated developing nowadays. Especially predictive models play a key role in numerous business intelligence tasks. However, while we extract information from unidentified data, there are critical problems of how to handle the missing values, especially in the data pre-processing phase. Therefore, it is important to identify which methods best deal with the missing data when building predictive models.
There are several papers dedicated in the research of strategies to deal with the missing values. The topic of Active-Feature Acquisition (aka. AFA) especially worked on the priority order of choosing which feature-value to acquire. The goal of AFA is to reduce the costs of achieving a desired model accuracy by identifying instances for which obtaining complete information is most informative. Followed by the AFA concept, we present an approach- I Sampling, in which feature-values are selected for acquisition based on the attribute on the top node of the current decision tree. Also we compare our approach with other methods in different situations and data missing patterns.
Experimental results demonstrate that our approach can induce accurate models using substantially fewer feature-value acquisitions as compared to alternative policies in some situations. The method we proposed can aid the further predictive works in academic and business area. They can therefore choose the right method based on their needs and obtain the informative data in an efficient way.
en_US
dc.description.tableofcontents 致謝詞 i
摘 要 ii
Abstract iii
表目錄 vi
圖目錄 vii
第一章 緒論 1
第一節 研究背景 1
第二節 研究動機與目的 2
第三節 研究架構 3
第四節 研究結果與貢獻 4
第五節 論文架構 5
第二章 文獻回顧 6
第一節 遺失值 6
2.1.1 遺失值的種類 6
2.1.2 遺失值處理方式 7
第二節 機器學習 12
2.2.1 發展與原理 12
2.2.2 AFA (Active feature-value acquisition) 14
第三章 研究方法 23
第一節 研究架構 23
第二節 I Sampling說明 24
3.2.1 研究想法 24
3.2.2 程式撰寫 25
3.2.3 假設與限制 25
第三節 I Sampling評估 25
3.3.1 檢驗方式 26
3.3.2 比較對象 26
3.3.3 評估指標 27
3.3.4 測試情境 28
第四章 研究結果 31
第一節 實驗數據 31
4.1.1 資料簡介 31
4.1.2 資料型態 32
第二節 實驗結果 33
4.2.1 建立預測模型時 33
4.2.2 未來新進資料欲分類時 50
第五章 結論與建議 55
第一節 結論 55
第二節 研究貢獻與建議 56
5.2.1 學術貢獻與建議 56
5.2.2 實務貢獻與建議 56
第三節 研究限制與後續研究方向 57
5.3.1 研究限制 57
5.3.2 未來研究建議 57
參考文獻 59
zh_TW
dc.format.extent 1152261 bytes-
dc.format.mimetype application/pdf-
dc.language.iso en_US-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0101355006en_US
dc.subject (關鍵詞) 預測模型zh_TW
dc.subject (關鍵詞) 遺失值zh_TW
dc.subject (關鍵詞) Active Feature-value Acquisitionzh_TW
dc.subject (關鍵詞) 決策樹zh_TW
dc.subject (關鍵詞) Predictive Modelen_US
dc.subject (關鍵詞) Missing Valueen_US
dc.subject (關鍵詞) Active Feature-value Acquisitionen_US
dc.subject (關鍵詞) Decision Treeen_US
dc.title (題名) 預測模型的遺失值處理─選值順序的研究zh_TW
dc.title (題名) Handling Missing Values in Predictive Model - Research of the Order of Data Acquisitionen_US
dc.type (資料類型) thesisen
dc.relation.reference (參考文獻) [英文文獻]
1.Bennett, D. A. (2001), “How can I deal with missing data in my study? “Australian and New Zealand Journal of Public Health, 25(5), 464–469.
2.Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). The KDD Process for Extracting Useful Knowledge from Volumes of Data. Communications of the ACM, 39(11), 27-35.
3.Gilks, W. R., Richardson, S.,& Spiegelhalter, D. J. (1996). Introducing Markov chain Monte Carlo. In Markov chain Monte Carlo in practice (pp. 1-19). London: Chapman & hall/CRC.
4.Kohavi, R. (1995, August). A study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In IJCAI, (Vol.14, No.2, pp. 1137-1145).
5.Levin, N., & Zahavi, J. (2001). Predictive modeling using segmentation. Journal of Interactive Marketing, 15(2), 2-22.
6.Lindenbaum, M., Markovitch, S., & Rusakov, D. (2004). Selective Sampling for Nearest Neighbor Classifiers. Machine Learning, 54(2), 125-152.
7.Lizotte, D. J., Madani, O., & Greiner, R. (2002, August). Budgeted learning of Naive-Bayes Classifiers. In Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence (pp. 378-385). Morgan Kaufmann Publishers Inc..
8.Melville, P., Saar-Tsechansky, M., Provost, F., & Mooney, R. (2004, November). Active Feature-Value Acquisition for Classifier Induction. In Proceedings of the 4th IEEE International Conference on Data Mining. (pp. 483-486). Brighton, UK.
9.Peng, C. Y. J., Harwell, M., Liou, S.M., & Ehman, L.H. (2006). Advances in missing data methods and implications for educational research. In Real data analysis, 31-78. North Carolina,US : Information Age Publishing.
10.Pyle , D. (1999). Data Preparation for Data Mining. Massachusetts: Morgan Kaufmann.
11.Quinlan, J. R. (1989, December). Unknown attribute values in induction. In ML (pp. 164-168).
12.Redman, T. C. (1996). Data quality for the information age. Massachusetts: Artech House, Incorporated.
13.Rubin, D. B. (1987). Multiple imputation for non-response in surveys. New York: John Wiley & Sons.
14.Saar-Tsechansky, M., Melville, P., & Provost, F. (2009, 4). Active Feature-Value Acquisition. Management Science,55(4), 664-684.
15.Schafer, J. L. (1999). Multiple imputation: a primer. Statiscal methods in medical research, 8(1), 3-15.
16.Schlomer, G. L., Bauman, S., & Card, N. A. (2010). Best Practices for Missing Data Management in Counseling Psychology. Journal of Counseling Psychology, 57(1), 1-10.
17.Settles, B. (2010). Active Learning Literature Survey. Computer Sciences Technical Report 1648, Unversity of Wisconsin, Madison, 52, 55-66.
18.Simon, H. A., & Lea, G. (1974). Problem solving and rule induction: A unified view. Knowledge and cognition. Oxford, England: Lawrence Erlbaum.
19.Tong, S., & Koller, D. (2001, August). Active learning for structure in Bayesian networks. In International joint conference on artificial intelligence, (vol. 17, No.1, pp. 863-869).
20.Vinod, N. C., & Punithavalli, D. M. (2011). Classification of Incomplete Data Handling Techniques-An Overview. International Journal on Computer Science and Engineering, 3(1), 340-344.
21.Zheng, Z., & Padmanabhan, B. (2002). On Active Learning for Data Acquisition. In Proceedings of IEEE International Condference on Data Mining. (pp. 562-569).
22.Zhu, X., & Wu, X. (2005). Cost-Constrained Data Acquisition for Intelligent Data Preparation. IEEE Transactions on Knowledge and Data Engineering, 17(11), 1542-1556.
[中文文獻]
1.麥爾荀伯格、庫基耶 (2013),大數據 (初版) (林俊宏譯),台北市:天下文化 (原著出版年:2013年)。
2.王鴻龍、楊孟麗、陳俊如、林定香 (2012),缺失資料在因素分析上的處理方法之研究,教育科學研究期刊,第五十七卷第一期,頁29-50。
3.吳元彰、沈永勝、楊鍵樵 (2007),應用加權式灰關聯法與自動分群技術於遺失值填補問題,技術學刊,第二十二卷第一期,頁77-87。
4.彼得杜拉克(1980),動盪時代下的經營(初版)(李辛模譯),台北市: 現代企業經營管理 (原著出版年:1980年)。
5.林惠玲、陳正倉 (2004),統計學:方法與應用,台北市:雙葉書廊。
6.林曉芳 (2002),以 Hot deck 插補法推估成就測驗之不完整作答反應,國立政治大學教育學系教育心理與輔導組博士學位論文,未出版,台北市。
7.翁頌舜、梁德馨 (2002),資料採礦資料缺值插補之變異數分析,輔仁管理評論,第九卷第三期,頁163-180。
8.馬芳資、林我聰 (2003),決策樹形式知識之線上預測系統架構,圖書館學與資訊科學,第二十九卷第二期,頁60-76。
9.陳信木、林佳瑩 (1997),調查資料之遺漏值的處置─以熱卡插補法為例,調查研究─方法與應用,第三期,頁75-106。
10.黃齡葦 (2005),遺失資料之多重插補法模擬比較,國立台灣大學農藝學研究所碩士論文,未出版,台北市。
[網路資料]
1.UCI machine Learning Repository. (n.d.). Retrieved from https://archive.ics.uci.edu/ml/index.html
zh_TW