學術產出-學位論文
文章檢視/開啟
書目匯出
-
題名 預測模型的遺失值處理─選值順序的研究
Handling Missing Values in Predictive Model - Research of the Order of Data Acquisition作者 黃秋芸
Huang, Chiu Yun貢獻者 唐揆
Tang, Kwei
黃秋芸
Huang, Chiu Yun關鍵詞 預測模型
遺失值
Active Feature-value Acquisition
決策樹
Predictive Model
Missing Value
Active Feature-value Acquisition
Decision Tree日期 2013 上傳時間 7-七月-2014 11:10:36 (UTC+8) 摘要 商業知識的發展突飛猛進,其中,預測模型在眾多商業智慧中扮演重要的角色,然而,當我們從大量資料萃取隱藏、未知與潛在具有實用性的資訊處理過程時,往往會遇到許多資料品質上的問題而難以著手分析,尤其是遺失值 (Missing value)的問題在資料前置處理階段更是常見的困難。因此,要如何在建立預測模型時有效的處理遺失值是一個很重要的議題。過去已有許多文獻致力於遺失值處理的議題,其中,Active Feature-Value Acquisition的相關研究更針對訓練資料的選填順序深入探討。Active Feature-Value Acquisition的概念是從具有遺失值的訓練資料中,選擇適當的遺失資料填補,讓預測的模型在最具效率的情況下達到理想的準確率。本研究將延續Active Feature-Value Acquisition的研究主軸,優先考量決策樹上的節點為遺失值選值填補的順序,提出一個新的訓練資料遺失值的選填順序方法─I Sampling,並透過實際的數據進行訓練與測試,同時我們也與過去文獻所提出的方法進行比較,了解不同的填值偵測與順序的選擇對於一個預測模型的分類準確率是否有影響,並了解各個方法的優缺點與在不同情境下的適用性。本研究所提出的新方法與驗證的結果,將可給予未來從事預測行為的管理或學術工作一些參考與建議,可以依據不同性質的資料採取合宜的選值方式,以節省取值的成本並提高預測模型的分類能力。
The importance of business intelligence is accelerated developing nowadays. Especially predictive models play a key role in numerous business intelligence tasks. However, while we extract information from unidentified data, there are critical problems of how to handle the missing values, especially in the data pre-processing phase. Therefore, it is important to identify which methods best deal with the missing data when building predictive models.There are several papers dedicated in the research of strategies to deal with the missing values. The topic of Active-Feature Acquisition (aka. AFA) especially worked on the priority order of choosing which feature-value to acquire. The goal of AFA is to reduce the costs of achieving a desired model accuracy by identifying instances for which obtaining complete information is most informative. Followed by the AFA concept, we present an approach- I Sampling, in which feature-values are selected for acquisition based on the attribute on the top node of the current decision tree. Also we compare our approach with other methods in different situations and data missing patterns. Experimental results demonstrate that our approach can induce accurate models using substantially fewer feature-value acquisitions as compared to alternative policies in some situations. The method we proposed can aid the further predictive works in academic and business area. They can therefore choose the right method based on their needs and obtain the informative data in an efficient way.參考文獻 [英文文獻]1.Bennett, D. A. (2001), “How can I deal with missing data in my study? “Australian and New Zealand Journal of Public Health, 25(5), 464–469.2.Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). The KDD Process for Extracting Useful Knowledge from Volumes of Data. Communications of the ACM, 39(11), 27-35.3.Gilks, W. R., Richardson, S.,& Spiegelhalter, D. J. (1996). Introducing Markov chain Monte Carlo. In Markov chain Monte Carlo in practice (pp. 1-19). London: Chapman & hall/CRC.4.Kohavi, R. (1995, August). A study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In IJCAI, (Vol.14, No.2, pp. 1137-1145).5.Levin, N., & Zahavi, J. (2001). Predictive modeling using segmentation. Journal of Interactive Marketing, 15(2), 2-22.6.Lindenbaum, M., Markovitch, S., & Rusakov, D. (2004). Selective Sampling for Nearest Neighbor Classifiers. Machine Learning, 54(2), 125-152.7.Lizotte, D. J., Madani, O., & Greiner, R. (2002, August). Budgeted learning of Naive-Bayes Classifiers. In Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence (pp. 378-385). Morgan Kaufmann Publishers Inc..8.Melville, P., Saar-Tsechansky, M., Provost, F., & Mooney, R. (2004, November). Active Feature-Value Acquisition for Classifier Induction. In Proceedings of the 4th IEEE International Conference on Data Mining. (pp. 483-486). Brighton, UK.9.Peng, C. Y. J., Harwell, M., Liou, S.M., & Ehman, L.H. (2006). Advances in missing data methods and implications for educational research. In Real data analysis, 31-78. North Carolina,US : Information Age Publishing.10.Pyle , D. (1999). Data Preparation for Data Mining. Massachusetts: Morgan Kaufmann.11.Quinlan, J. R. (1989, December). Unknown attribute values in induction. In ML (pp. 164-168).12.Redman, T. C. (1996). Data quality for the information age. Massachusetts: Artech House, Incorporated.13.Rubin, D. B. (1987). Multiple imputation for non-response in surveys. New York: John Wiley & Sons.14.Saar-Tsechansky, M., Melville, P., & Provost, F. (2009, 4). Active Feature-Value Acquisition. Management Science,55(4), 664-684.15.Schafer, J. L. (1999). Multiple imputation: a primer. Statiscal methods in medical research, 8(1), 3-15.16.Schlomer, G. L., Bauman, S., & Card, N. A. (2010). Best Practices for Missing Data Management in Counseling Psychology. Journal of Counseling Psychology, 57(1), 1-10.17.Settles, B. (2010). Active Learning Literature Survey. Computer Sciences Technical Report 1648, Unversity of Wisconsin, Madison, 52, 55-66.18.Simon, H. A., & Lea, G. (1974). Problem solving and rule induction: A unified view. Knowledge and cognition. Oxford, England: Lawrence Erlbaum.19.Tong, S., & Koller, D. (2001, August). Active learning for structure in Bayesian networks. In International joint conference on artificial intelligence, (vol. 17, No.1, pp. 863-869). 20.Vinod, N. C., & Punithavalli, D. M. (2011). Classification of Incomplete Data Handling Techniques-An Overview. International Journal on Computer Science and Engineering, 3(1), 340-344.21.Zheng, Z., & Padmanabhan, B. (2002). On Active Learning for Data Acquisition. In Proceedings of IEEE International Condference on Data Mining. (pp. 562-569). 22.Zhu, X., & Wu, X. (2005). Cost-Constrained Data Acquisition for Intelligent Data Preparation. IEEE Transactions on Knowledge and Data Engineering, 17(11), 1542-1556.[中文文獻]1.麥爾荀伯格、庫基耶 (2013),大數據 (初版) (林俊宏譯),台北市:天下文化 (原著出版年:2013年)。2.王鴻龍、楊孟麗、陳俊如、林定香 (2012),缺失資料在因素分析上的處理方法之研究,教育科學研究期刊,第五十七卷第一期,頁29-50。3.吳元彰、沈永勝、楊鍵樵 (2007),應用加權式灰關聯法與自動分群技術於遺失值填補問題,技術學刊,第二十二卷第一期,頁77-87。4.彼得杜拉克(1980),動盪時代下的經營(初版)(李辛模譯),台北市: 現代企業經營管理 (原著出版年:1980年)。5.林惠玲、陳正倉 (2004),統計學:方法與應用,台北市:雙葉書廊。6.林曉芳 (2002),以 Hot deck 插補法推估成就測驗之不完整作答反應,國立政治大學教育學系教育心理與輔導組博士學位論文,未出版,台北市。7.翁頌舜、梁德馨 (2002),資料採礦資料缺值插補之變異數分析,輔仁管理評論,第九卷第三期,頁163-180。8.馬芳資、林我聰 (2003),決策樹形式知識之線上預測系統架構,圖書館學與資訊科學,第二十九卷第二期,頁60-76。9.陳信木、林佳瑩 (1997),調查資料之遺漏值的處置─以熱卡插補法為例,調查研究─方法與應用,第三期,頁75-106。10.黃齡葦 (2005),遺失資料之多重插補法模擬比較,國立台灣大學農藝學研究所碩士論文,未出版,台北市。[網路資料]1.UCI machine Learning Repository. (n.d.). Retrieved from https://archive.ics.uci.edu/ml/index.html 描述 碩士
國立政治大學
企業管理研究所
101355006
102資料來源 http://thesis.lib.nccu.edu.tw/record/#G0101355006 資料類型 thesis dc.contributor.advisor 唐揆 zh_TW dc.contributor.advisor Tang, Kwei en_US dc.contributor.author (作者) 黃秋芸 zh_TW dc.contributor.author (作者) Huang, Chiu Yun en_US dc.creator (作者) 黃秋芸 zh_TW dc.creator (作者) Huang, Chiu Yun en_US dc.date (日期) 2013 en_US dc.date.accessioned 7-七月-2014 11:10:36 (UTC+8) - dc.date.available 7-七月-2014 11:10:36 (UTC+8) - dc.date.issued (上傳時間) 7-七月-2014 11:10:36 (UTC+8) - dc.identifier (其他 識別碼) G0101355006 en_US dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/67317 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 企業管理研究所 zh_TW dc.description (描述) 101355006 zh_TW dc.description (描述) 102 zh_TW dc.description.abstract (摘要) 商業知識的發展突飛猛進,其中,預測模型在眾多商業智慧中扮演重要的角色,然而,當我們從大量資料萃取隱藏、未知與潛在具有實用性的資訊處理過程時,往往會遇到許多資料品質上的問題而難以著手分析,尤其是遺失值 (Missing value)的問題在資料前置處理階段更是常見的困難。因此,要如何在建立預測模型時有效的處理遺失值是一個很重要的議題。過去已有許多文獻致力於遺失值處理的議題,其中,Active Feature-Value Acquisition的相關研究更針對訓練資料的選填順序深入探討。Active Feature-Value Acquisition的概念是從具有遺失值的訓練資料中,選擇適當的遺失資料填補,讓預測的模型在最具效率的情況下達到理想的準確率。本研究將延續Active Feature-Value Acquisition的研究主軸,優先考量決策樹上的節點為遺失值選值填補的順序,提出一個新的訓練資料遺失值的選填順序方法─I Sampling,並透過實際的數據進行訓練與測試,同時我們也與過去文獻所提出的方法進行比較,了解不同的填值偵測與順序的選擇對於一個預測模型的分類準確率是否有影響,並了解各個方法的優缺點與在不同情境下的適用性。本研究所提出的新方法與驗證的結果,將可給予未來從事預測行為的管理或學術工作一些參考與建議,可以依據不同性質的資料採取合宜的選值方式,以節省取值的成本並提高預測模型的分類能力。 zh_TW dc.description.abstract (摘要) The importance of business intelligence is accelerated developing nowadays. Especially predictive models play a key role in numerous business intelligence tasks. However, while we extract information from unidentified data, there are critical problems of how to handle the missing values, especially in the data pre-processing phase. Therefore, it is important to identify which methods best deal with the missing data when building predictive models.There are several papers dedicated in the research of strategies to deal with the missing values. The topic of Active-Feature Acquisition (aka. AFA) especially worked on the priority order of choosing which feature-value to acquire. The goal of AFA is to reduce the costs of achieving a desired model accuracy by identifying instances for which obtaining complete information is most informative. Followed by the AFA concept, we present an approach- I Sampling, in which feature-values are selected for acquisition based on the attribute on the top node of the current decision tree. Also we compare our approach with other methods in different situations and data missing patterns. Experimental results demonstrate that our approach can induce accurate models using substantially fewer feature-value acquisitions as compared to alternative policies in some situations. The method we proposed can aid the further predictive works in academic and business area. They can therefore choose the right method based on their needs and obtain the informative data in an efficient way. en_US dc.description.tableofcontents 致謝詞 i摘 要 iiAbstract iii表目錄 vi圖目錄 vii第一章 緒論 1第一節 研究背景 1第二節 研究動機與目的 2第三節 研究架構 3第四節 研究結果與貢獻 4第五節 論文架構 5第二章 文獻回顧 6第一節 遺失值 62.1.1 遺失值的種類 62.1.2 遺失值處理方式 7第二節 機器學習 122.2.1 發展與原理 122.2.2 AFA (Active feature-value acquisition) 14第三章 研究方法 23第一節 研究架構 23第二節 I Sampling說明 243.2.1 研究想法 243.2.2 程式撰寫 253.2.3 假設與限制 25第三節 I Sampling評估 253.3.1 檢驗方式 263.3.2 比較對象 263.3.3 評估指標 273.3.4 測試情境 28第四章 研究結果 31第一節 實驗數據 314.1.1 資料簡介 314.1.2 資料型態 32第二節 實驗結果 334.2.1 建立預測模型時 334.2.2 未來新進資料欲分類時 50第五章 結論與建議 55第一節 結論 55第二節 研究貢獻與建議 565.2.1 學術貢獻與建議 565.2.2 實務貢獻與建議 56第三節 研究限制與後續研究方向 575.3.1 研究限制 575.3.2 未來研究建議 57參考文獻 59 zh_TW dc.format.extent 1152261 bytes - dc.format.mimetype application/pdf - dc.language.iso en_US - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0101355006 en_US dc.subject (關鍵詞) 預測模型 zh_TW dc.subject (關鍵詞) 遺失值 zh_TW dc.subject (關鍵詞) Active Feature-value Acquisition zh_TW dc.subject (關鍵詞) 決策樹 zh_TW dc.subject (關鍵詞) Predictive Model en_US dc.subject (關鍵詞) Missing Value en_US dc.subject (關鍵詞) Active Feature-value Acquisition en_US dc.subject (關鍵詞) Decision Tree en_US dc.title (題名) 預測模型的遺失值處理─選值順序的研究 zh_TW dc.title (題名) Handling Missing Values in Predictive Model - Research of the Order of Data Acquisition en_US dc.type (資料類型) thesis en dc.relation.reference (參考文獻) [英文文獻]1.Bennett, D. A. (2001), “How can I deal with missing data in my study? “Australian and New Zealand Journal of Public Health, 25(5), 464–469.2.Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). The KDD Process for Extracting Useful Knowledge from Volumes of Data. Communications of the ACM, 39(11), 27-35.3.Gilks, W. R., Richardson, S.,& Spiegelhalter, D. J. (1996). Introducing Markov chain Monte Carlo. In Markov chain Monte Carlo in practice (pp. 1-19). London: Chapman & hall/CRC.4.Kohavi, R. (1995, August). A study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In IJCAI, (Vol.14, No.2, pp. 1137-1145).5.Levin, N., & Zahavi, J. (2001). Predictive modeling using segmentation. Journal of Interactive Marketing, 15(2), 2-22.6.Lindenbaum, M., Markovitch, S., & Rusakov, D. (2004). Selective Sampling for Nearest Neighbor Classifiers. Machine Learning, 54(2), 125-152.7.Lizotte, D. J., Madani, O., & Greiner, R. (2002, August). Budgeted learning of Naive-Bayes Classifiers. In Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence (pp. 378-385). Morgan Kaufmann Publishers Inc..8.Melville, P., Saar-Tsechansky, M., Provost, F., & Mooney, R. (2004, November). Active Feature-Value Acquisition for Classifier Induction. In Proceedings of the 4th IEEE International Conference on Data Mining. (pp. 483-486). Brighton, UK.9.Peng, C. Y. J., Harwell, M., Liou, S.M., & Ehman, L.H. (2006). Advances in missing data methods and implications for educational research. In Real data analysis, 31-78. North Carolina,US : Information Age Publishing.10.Pyle , D. (1999). Data Preparation for Data Mining. Massachusetts: Morgan Kaufmann.11.Quinlan, J. R. (1989, December). Unknown attribute values in induction. In ML (pp. 164-168).12.Redman, T. C. (1996). Data quality for the information age. Massachusetts: Artech House, Incorporated.13.Rubin, D. B. (1987). Multiple imputation for non-response in surveys. New York: John Wiley & Sons.14.Saar-Tsechansky, M., Melville, P., & Provost, F. (2009, 4). Active Feature-Value Acquisition. Management Science,55(4), 664-684.15.Schafer, J. L. (1999). Multiple imputation: a primer. Statiscal methods in medical research, 8(1), 3-15.16.Schlomer, G. L., Bauman, S., & Card, N. A. (2010). Best Practices for Missing Data Management in Counseling Psychology. Journal of Counseling Psychology, 57(1), 1-10.17.Settles, B. (2010). Active Learning Literature Survey. Computer Sciences Technical Report 1648, Unversity of Wisconsin, Madison, 52, 55-66.18.Simon, H. A., & Lea, G. (1974). Problem solving and rule induction: A unified view. Knowledge and cognition. Oxford, England: Lawrence Erlbaum.19.Tong, S., & Koller, D. (2001, August). Active learning for structure in Bayesian networks. In International joint conference on artificial intelligence, (vol. 17, No.1, pp. 863-869). 20.Vinod, N. C., & Punithavalli, D. M. (2011). Classification of Incomplete Data Handling Techniques-An Overview. International Journal on Computer Science and Engineering, 3(1), 340-344.21.Zheng, Z., & Padmanabhan, B. (2002). On Active Learning for Data Acquisition. In Proceedings of IEEE International Condference on Data Mining. (pp. 562-569). 22.Zhu, X., & Wu, X. (2005). Cost-Constrained Data Acquisition for Intelligent Data Preparation. IEEE Transactions on Knowledge and Data Engineering, 17(11), 1542-1556.[中文文獻]1.麥爾荀伯格、庫基耶 (2013),大數據 (初版) (林俊宏譯),台北市:天下文化 (原著出版年:2013年)。2.王鴻龍、楊孟麗、陳俊如、林定香 (2012),缺失資料在因素分析上的處理方法之研究,教育科學研究期刊,第五十七卷第一期,頁29-50。3.吳元彰、沈永勝、楊鍵樵 (2007),應用加權式灰關聯法與自動分群技術於遺失值填補問題,技術學刊,第二十二卷第一期,頁77-87。4.彼得杜拉克(1980),動盪時代下的經營(初版)(李辛模譯),台北市: 現代企業經營管理 (原著出版年:1980年)。5.林惠玲、陳正倉 (2004),統計學:方法與應用,台北市:雙葉書廊。6.林曉芳 (2002),以 Hot deck 插補法推估成就測驗之不完整作答反應,國立政治大學教育學系教育心理與輔導組博士學位論文,未出版,台北市。7.翁頌舜、梁德馨 (2002),資料採礦資料缺值插補之變異數分析,輔仁管理評論,第九卷第三期,頁163-180。8.馬芳資、林我聰 (2003),決策樹形式知識之線上預測系統架構,圖書館學與資訊科學,第二十九卷第二期,頁60-76。9.陳信木、林佳瑩 (1997),調查資料之遺漏值的處置─以熱卡插補法為例,調查研究─方法與應用,第三期,頁75-106。10.黃齡葦 (2005),遺失資料之多重插補法模擬比較,國立台灣大學農藝學研究所碩士論文,未出版,台北市。[網路資料]1.UCI machine Learning Repository. (n.d.). Retrieved from https://archive.ics.uci.edu/ml/index.html zh_TW