預測模型的遺失值處理─選值順序的研究

Publications-Theses

Article View/Open

pdf(1077)

Publication Export

Google Scholar^TM

題名	預測模型的遺失值處理─選值順序的研究 Handling Missing Values in Predictive Model - Research of the Order of Data Acquisition
作者	黃秋芸 Huang, Chiu Yun
貢獻者	唐揆 Tang, Kwei 黃秋芸 Huang, Chiu Yun
關鍵詞	預測模型遺失值 Active Feature-value Acquisition 決策樹 Predictive Model Missing Value Active Feature-value Acquisition Decision Tree
日期	2013
上傳時間	7-Jul-2014 11:10:36 (UTC+8)
摘要	商業知識的發展突飛猛進，其中，預測模型在眾多商業智慧中扮演重要的角色，然而，當我們從大量資料萃取隱藏、未知與潛在具有實用性的資訊處理過程時，往往會遇到許多資料品質上的問題而難以著手分析，尤其是遺失值 (Missing value)的問題在資料前置處理階段更是常見的困難。因此，要如何在建立預測模型時有效的處理遺失值是一個很重要的議題。過去已有許多文獻致力於遺失值處理的議題，其中，Active Feature-Value Acquisition的相關研究更針對訓練資料的選填順序深入探討。Active Feature-Value Acquisition的概念是從具有遺失值的訓練資料中，選擇適當的遺失資料填補，讓預測的模型在最具效率的情況下達到理想的準確率。本研究將延續Active Feature-Value Acquisition的研究主軸，優先考量決策樹上的節點為遺失值選值填補的順序，提出一個新的訓練資料遺失值的選填順序方法─I Sampling，並透過實際的數據進行訓練與測試，同時我們也與過去文獻所提出的方法進行比較，了解不同的填值偵測與順序的選擇對於一個預測模型的分類準確率是否有影響，並了解各個方法的優缺點與在不同情境下的適用性。本研究所提出的新方法與驗證的結果，將可給予未來從事預測行為的管理或學術工作一些參考與建議，可以依據不同性質的資料採取合宜的選值方式，以節省取值的成本並提高預測模型的分類能力。 The importance of business intelligence is accelerated developing nowadays. Especially predictive models play a key role in numerous business intelligence tasks. However, while we extract information from unidentified data, there are critical problems of how to handle the missing values, especially in the data pre-processing phase. Therefore, it is important to identify which methods best deal with the missing data when building predictive models. There are several papers dedicated in the research of strategies to deal with the missing values. The topic of Active-Feature Acquisition (aka. AFA) especially worked on the priority order of choosing which feature-value to acquire. The goal of AFA is to reduce the costs of achieving a desired model accuracy by identifying instances for which obtaining complete information is most informative. Followed by the AFA concept, we present an approach- I Sampling, in which feature-values are selected for acquisition based on the attribute on the top node of the current decision tree. Also we compare our approach with other methods in different situations and data missing patterns. Experimental results demonstrate that our approach can induce accurate models using substantially fewer feature-value acquisitions as compared to alternative policies in some situations. The method we proposed can aid the further predictive works in academic and business area. They can therefore choose the right method based on their needs and obtain the informative data in an efficient way.
參考文獻	[英文文獻] 1.Bennett, D. A. (2001), “How can I deal with missing data in my study? “Australian and New Zealand Journal of Public Health, 25(5), 464–469. 2.Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). The KDD Process for Extracting Useful Knowledge from Volumes of Data. Communications of the ACM, 39(11), 27-35. 3.Gilks, W. R., Richardson, S.,& Spiegelhalter, D. J. (1996). Introducing Markov chain Monte Carlo. In Markov chain Monte Carlo in practice (pp. 1-19). London: Chapman & hall/CRC. 4.Kohavi, R. (1995, August). A study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In IJCAI, (Vol.14, No.2, pp. 1137-1145). 5.Levin, N., & Zahavi, J. (2001). Predictive modeling using segmentation. Journal of Interactive Marketing, 15(2), 2-22. 6.Lindenbaum, M., Markovitch, S., & Rusakov, D. (2004). Selective Sampling for Nearest Neighbor Classifiers. Machine Learning, 54(2), 125-152. 7.Lizotte, D. J., Madani, O., & Greiner, R. (2002, August). Budgeted learning of Naive-Bayes Classifiers. In Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence (pp. 378-385). Morgan Kaufmann Publishers Inc.. 8.Melville, P., Saar-Tsechansky, M., Provost, F., & Mooney, R. (2004, November). Active Feature-Value Acquisition for Classifier Induction. In Proceedings of the 4th IEEE International Conference on Data Mining. (pp. 483-486). Brighton, UK. 9.Peng, C. Y. J., Harwell, M., Liou, S.M., & Ehman, L.H. (2006). Advances in missing data methods and implications for educational research. In Real data analysis, 31-78. North Carolina,US : Information Age Publishing. 10.Pyle , D. (1999). Data Preparation for Data Mining. Massachusetts: Morgan Kaufmann. 11.Quinlan, J. R. (1989, December). Unknown attribute values in induction. In ML (pp. 164-168). 12.Redman, T. C. (1996). Data quality for the information age. Massachusetts: Artech House, Incorporated. 13.Rubin, D. B. (1987). Multiple imputation for non-response in surveys. New York: John Wiley & Sons. 14.Saar-Tsechansky, M., Melville, P., & Provost, F. (2009, 4). Active Feature-Value Acquisition. Management Science,55(4), 664-684. 15.Schafer, J. L. (1999). Multiple imputation: a primer. Statiscal methods in medical research, 8(1), 3-15. 16.Schlomer, G. L., Bauman, S., & Card, N. A. (2010). Best Practices for Missing Data Management in Counseling Psychology. Journal of Counseling Psychology, 57(1), 1-10. 17.Settles, B. (2010). Active Learning Literature Survey. Computer Sciences Technical Report 1648, Unversity of Wisconsin, Madison, 52, 55-66. 18.Simon, H. A., & Lea, G. (1974). Problem solving and rule induction: A unified view. Knowledge and cognition. Oxford, England: Lawrence Erlbaum. 19.Tong, S., & Koller, D. (2001, August). Active learning for structure in Bayesian networks. In International joint conference on artificial intelligence, (vol. 17, No.1, pp. 863-869). 20.Vinod, N. C., & Punithavalli, D. M. (2011). Classification of Incomplete Data Handling Techniques-An Overview. International Journal on Computer Science and Engineering, 3(1), 340-344. 21.Zheng, Z., & Padmanabhan, B. (2002). On Active Learning for Data Acquisition. In Proceedings of IEEE International Condference on Data Mining. (pp. 562-569). 22.Zhu, X., & Wu, X. (2005). Cost-Constrained Data Acquisition for Intelligent Data Preparation. IEEE Transactions on Knowledge and Data Engineering, 17(11), 1542-1556. [中文文獻] 1.麥爾荀伯格、庫基耶 (2013)，大數據 (初版) (林俊宏譯)，台北市：天下文化 (原著出版年：2013年)。 2.王鴻龍、楊孟麗、陳俊如、林定香 (2012)，缺失資料在因素分析上的處理方法之研究，教育科學研究期刊，第五十七卷第一期，頁29-50。 3.吳元彰、沈永勝、楊鍵樵 (2007)，應用加權式灰關聯法與自動分群技術於遺失值填補問題，技術學刊，第二十二卷第一期，頁77-87。 4.彼得杜拉克(1980)，動盪時代下的經營(初版)(李辛模譯)，台北市: 現代企業經營管理 (原著出版年：1980年)。 5.林惠玲、陳正倉 (2004)，統計學：方法與應用，台北市：雙葉書廊。 6.林曉芳 (2002)，以 Hot deck 插補法推估成就測驗之不完整作答反應，國立政治大學教育學系教育心理與輔導組博士學位論文，未出版，台北市。 7.翁頌舜、梁德馨 (2002)，資料採礦資料缺值插補之變異數分析，輔仁管理評論，第九卷第三期，頁163-180。 8.馬芳資、林我聰 (2003)，決策樹形式知識之線上預測系統架構，圖書館學與資訊科學，第二十九卷第二期，頁60-76。 9.陳信木、林佳瑩 (1997)，調查資料之遺漏值的處置─以熱卡插補法為例，調查研究─方法與應用，第三期，頁75-106。 10.黃齡葦 (2005)，遺失資料之多重插補法模擬比較，國立台灣大學農藝學研究所碩士論文，未出版，台北市。 [網路資料] 1.UCI machine Learning Repository. (n.d.). Retrieved from https://archive.ics.uci.edu/ml/index.html
描述	碩士國立政治大學企業管理研究所 101355006 102
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0101355006
資料類型	thesis

dc.contributor.advisor	唐揆	zh_TW
dc.contributor.advisor	Tang, Kwei	en_US
dc.contributor.author (Authors)	黃秋芸	zh_TW
dc.contributor.author (Authors)	Huang, Chiu Yun	en_US
dc.creator (作者)	黃秋芸	zh_TW
dc.creator (作者)	Huang, Chiu Yun	en_US
dc.date (日期)	2013	en_US
dc.date.accessioned	7-Jul-2014 11:10:36 (UTC+8)	-
dc.date.available	7-Jul-2014 11:10:36 (UTC+8)	-
dc.date.issued (上傳時間)	7-Jul-2014 11:10:36 (UTC+8)	-
dc.identifier (Other Identifiers)	G0101355006	en_US
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/67317	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	企業管理研究所	zh_TW
dc.description (描述)	101355006	zh_TW
dc.description (描述)	102	zh_TW
dc.description.abstract (摘要)	商業知識的發展突飛猛進，其中，預測模型在眾多商業智慧中扮演重要的角色，然而，當我們從大量資料萃取隱藏、未知與潛在具有實用性的資訊處理過程時，往往會遇到許多資料品質上的問題而難以著手分析，尤其是遺失值 (Missing value)的問題在資料前置處理階段更是常見的困難。因此，要如何在建立預測模型時有效的處理遺失值是一個很重要的議題。過去已有許多文獻致力於遺失值處理的議題，其中，Active Feature-Value Acquisition的相關研究更針對訓練資料的選填順序深入探討。Active Feature-Value Acquisition的概念是從具有遺失值的訓練資料中，選擇適當的遺失資料填補，讓預測的模型在最具效率的情況下達到理想的準確率。本研究將延續Active Feature-Value Acquisition的研究主軸，優先考量決策樹上的節點為遺失值選值填補的順序，提出一個新的訓練資料遺失值的選填順序方法─I Sampling，並透過實際的數據進行訓練與測試，同時我們也與過去文獻所提出的方法進行比較，了解不同的填值偵測與順序的選擇對於一個預測模型的分類準確率是否有影響，並了解各個方法的優缺點與在不同情境下的適用性。本研究所提出的新方法與驗證的結果，將可給予未來從事預測行為的管理或學術工作一些參考與建議，可以依據不同性質的資料採取合宜的選值方式，以節省取值的成本並提高預測模型的分類能力。	zh_TW
dc.description.abstract (摘要)	The importance of business intelligence is accelerated developing nowadays. Especially predictive models play a key role in numerous business intelligence tasks. However, while we extract information from unidentified data, there are critical problems of how to handle the missing values, especially in the data pre-processing phase. Therefore, it is important to identify which methods best deal with the missing data when building predictive models. There are several papers dedicated in the research of strategies to deal with the missing values. The topic of Active-Feature Acquisition (aka. AFA) especially worked on the priority order of choosing which feature-value to acquire. The goal of AFA is to reduce the costs of achieving a desired model accuracy by identifying instances for which obtaining complete information is most informative. Followed by the AFA concept, we present an approach- I Sampling, in which feature-values are selected for acquisition based on the attribute on the top node of the current decision tree. Also we compare our approach with other methods in different situations and data missing patterns. Experimental results demonstrate that our approach can induce accurate models using substantially fewer feature-value acquisitions as compared to alternative policies in some situations. The method we proposed can aid the further predictive works in academic and business area. They can therefore choose the right method based on their needs and obtain the informative data in an efficient way.	en_US
dc.description.tableofcontents	致謝詞 i 摘要 ii Abstract iii 表目錄 vi 圖目錄 vii 第一章緒論 1 第一節研究背景 1 第二節研究動機與目的 2 第三節研究架構 3 第四節研究結果與貢獻 4 第五節論文架構 5 第二章文獻回顧 6 第一節遺失值 6 2.1.1 遺失值的種類 6 2.1.2 遺失值處理方式 7 第二節機器學習 12 2.2.1 發展與原理 12 2.2.2 AFA (Active feature-value acquisition) 14 第三章研究方法 23 第一節研究架構 23 第二節 I Sampling說明 24 3.2.1 研究想法 24 3.2.2 程式撰寫 25 3.2.3 假設與限制 25 第三節 I Sampling評估 25 3.3.1 檢驗方式 26 3.3.2 比較對象 26 3.3.3 評估指標 27 3.3.4 測試情境 28 第四章研究結果 31 第一節實驗數據 31 4.1.1 資料簡介 31 4.1.2 資料型態 32 第二節實驗結果 33 4.2.1 建立預測模型時 33 4.2.2 未來新進資料欲分類時 50 第五章結論與建議 55 第一節結論 55 第二節研究貢獻與建議 56 5.2.1 學術貢獻與建議 56 5.2.2 實務貢獻與建議 56 第三節研究限制與後續研究方向 57 5.3.1 研究限制 57 5.3.2 未來研究建議 57 參考文獻 59	zh_TW
dc.format.extent	1152261 bytes	-
dc.format.mimetype	application/pdf	-
dc.language.iso	en_US	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0101355006	en_US
dc.subject (關鍵詞)	預測模型	zh_TW
dc.subject (關鍵詞)	遺失值	zh_TW
dc.subject (關鍵詞)	Active Feature-value Acquisition	zh_TW
dc.subject (關鍵詞)	決策樹	zh_TW
dc.subject (關鍵詞)	Predictive Model	en_US
dc.subject (關鍵詞)	Missing Value	en_US
dc.subject (關鍵詞)	Active Feature-value Acquisition	en_US
dc.subject (關鍵詞)	Decision Tree	en_US
dc.title (題名)	預測模型的遺失值處理─選值順序的研究	zh_TW
dc.title (題名)	Handling Missing Values in Predictive Model - Research of the Order of Data Acquisition	en_US
dc.type (資料類型)	thesis	en
dc.relation.reference (參考文獻)	[英文文獻] 1.Bennett, D. A. (2001), “How can I deal with missing data in my study? “Australian and New Zealand Journal of Public Health, 25(5), 464–469. 2.Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). The KDD Process for Extracting Useful Knowledge from Volumes of Data. Communications of the ACM, 39(11), 27-35. 3.Gilks, W. R., Richardson, S.,& Spiegelhalter, D. J. (1996). Introducing Markov chain Monte Carlo. In Markov chain Monte Carlo in practice (pp. 1-19). London: Chapman & hall/CRC. 4.Kohavi, R. (1995, August). A study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In IJCAI, (Vol.14, No.2, pp. 1137-1145). 5.Levin, N., & Zahavi, J. (2001). Predictive modeling using segmentation. Journal of Interactive Marketing, 15(2), 2-22. 6.Lindenbaum, M., Markovitch, S., & Rusakov, D. (2004). Selective Sampling for Nearest Neighbor Classifiers. Machine Learning, 54(2), 125-152. 7.Lizotte, D. J., Madani, O., & Greiner, R. (2002, August). Budgeted learning of Naive-Bayes Classifiers. In Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence (pp. 378-385). Morgan Kaufmann Publishers Inc.. 8.Melville, P., Saar-Tsechansky, M., Provost, F., & Mooney, R. (2004, November). Active Feature-Value Acquisition for Classifier Induction. In Proceedings of the 4th IEEE International Conference on Data Mining. (pp. 483-486). Brighton, UK. 9.Peng, C. Y. J., Harwell, M., Liou, S.M., & Ehman, L.H. (2006). Advances in missing data methods and implications for educational research. In Real data analysis, 31-78. North Carolina,US : Information Age Publishing. 10.Pyle , D. (1999). Data Preparation for Data Mining. Massachusetts: Morgan Kaufmann. 11.Quinlan, J. R. (1989, December). Unknown attribute values in induction. In ML (pp. 164-168). 12.Redman, T. C. (1996). Data quality for the information age. Massachusetts: Artech House, Incorporated. 13.Rubin, D. B. (1987). Multiple imputation for non-response in surveys. New York: John Wiley & Sons. 14.Saar-Tsechansky, M., Melville, P., & Provost, F. (2009, 4). Active Feature-Value Acquisition. Management Science,55(4), 664-684. 15.Schafer, J. L. (1999). Multiple imputation: a primer. Statiscal methods in medical research, 8(1), 3-15. 16.Schlomer, G. L., Bauman, S., & Card, N. A. (2010). Best Practices for Missing Data Management in Counseling Psychology. Journal of Counseling Psychology, 57(1), 1-10. 17.Settles, B. (2010). Active Learning Literature Survey. Computer Sciences Technical Report 1648, Unversity of Wisconsin, Madison, 52, 55-66. 18.Simon, H. A., & Lea, G. (1974). Problem solving and rule induction: A unified view. Knowledge and cognition. Oxford, England: Lawrence Erlbaum. 19.Tong, S., & Koller, D. (2001, August). Active learning for structure in Bayesian networks. In International joint conference on artificial intelligence, (vol. 17, No.1, pp. 863-869). 20.Vinod, N. C., & Punithavalli, D. M. (2011). Classification of Incomplete Data Handling Techniques-An Overview. International Journal on Computer Science and Engineering, 3(1), 340-344. 21.Zheng, Z., & Padmanabhan, B. (2002). On Active Learning for Data Acquisition. In Proceedings of IEEE International Condference on Data Mining. (pp. 562-569). 22.Zhu, X., & Wu, X. (2005). Cost-Constrained Data Acquisition for Intelligent Data Preparation. IEEE Transactions on Knowledge and Data Engineering, 17(11), 1542-1556. [中文文獻] 1.麥爾荀伯格、庫基耶 (2013)，大數據 (初版) (林俊宏譯)，台北市：天下文化 (原著出版年：2013年)。 2.王鴻龍、楊孟麗、陳俊如、林定香 (2012)，缺失資料在因素分析上的處理方法之研究，教育科學研究期刊，第五十七卷第一期，頁29-50。 3.吳元彰、沈永勝、楊鍵樵 (2007)，應用加權式灰關聯法與自動分群技術於遺失值填補問題，技術學刊，第二十二卷第一期，頁77-87。 4.彼得杜拉克(1980)，動盪時代下的經營(初版)(李辛模譯)，台北市: 現代企業經營管理 (原著出版年：1980年)。 5.林惠玲、陳正倉 (2004)，統計學：方法與應用，台北市：雙葉書廊。 6.林曉芳 (2002)，以 Hot deck 插補法推估成就測驗之不完整作答反應，國立政治大學教育學系教育心理與輔導組博士學位論文，未出版，台北市。 7.翁頌舜、梁德馨 (2002)，資料採礦資料缺值插補之變異數分析，輔仁管理評論，第九卷第三期，頁163-180。 8.馬芳資、林我聰 (2003)，決策樹形式知識之線上預測系統架構，圖書館學與資訊科學，第二十九卷第二期，頁60-76。 9.陳信木、林佳瑩 (1997)，調查資料之遺漏值的處置─以熱卡插補法為例，調查研究─方法與應用，第三期，頁75-106。 10.黃齡葦 (2005)，遺失資料之多重插補法模擬比較，國立台灣大學農藝學研究所碩士論文，未出版，台北市。 [網路資料] 1.UCI machine Learning Repository. (n.d.). Retrieved from https://archive.ics.uci.edu/ml/index.html	zh_TW

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM