Publications-Theses
Article View/Open
Publication Export
-
題名 利用決策樹插補遺失值之模擬研究
Missing Data Imputation with Classification and Regression Trees: A Simulation Study作者 陳政揚
Chen, Jheng-Yang貢獻者 張育瑋
Chang, Yu-Wei
陳政揚
Chen, Jheng-Yang關鍵詞 CART
決策樹
疊代插補
插補遺失值
CART
Decision trees
Iterative imputation
Missing data imputation日期 2022 上傳時間 1-Aug-2022 17:16:33 (UTC+8) 摘要 遺失值的處理為資料分析前置處理之常見的議題。使用插補遺失值的方式讓資料成為完整資料再進行後續分析,是其中一種常見處理方法。本研究延續以決策樹插補遺失值的研究,比較了幾種文獻上現有的方法及一些小變形的插補表現。除了使用經典的CART演算法以外,也嘗試使用卡方檢定來找分割變數。有別於Rahman 與 Islam (2013) 提出的DMI方法,本文對於訓練資料的選取,不僅限於完全沒有遺失的觀測值,只要在要插補的變數沒有遺失值即可被選為訓練資料,可以更有效使用所有觀測值。在有遺失值的情境下建立決策樹,會遇到分割變數也有遺失值的問題,除了文獻上以平均數或眾數讓其通過的方法,本研究另外考慮兩種重抽的方式讓在分割變數為遺失值的元素通過。此外,參考文獻的一些疊代插補法,並將其運用於決策樹來插補遺失值:對於一筆資料,將各變數的遺失值疊代補值,直到收斂,這樣的方式可以避免文獻使用決策樹插補遺失值的通過問題,並且可以更有效率應用變數之間的關係。本研究使用模擬研究比較上述方法的優缺點,並且將這些方法應用至肝炎資料與信用卡核卡資料這兩筆實際資料。
Dealing with missing values is an issue in data process before we conduct data analysis. It is a popular approach to impute missing data so that we have a complete data set for further data analysis. The current study continues the studies of imputing missing data using decision trees, we modify some methods in the literature and compare their imputation performance. In addition to the CART algorithm, chi-square tests are performed to find the split variable. Different from the DMI method proposed by Rahman and Islam (2013), the composition of the training data set is not limited to those observations without any missing values, but all the observations whose response variable is available are used for training in the current study. Through the modification, we tried to make most use of all the observed data. Besides, one would encounter the issue that there is a missing value in a split variable when building a decision tree using a data set with missing values. In addition to the imputation using the mean or mode so that all elements are able to be available down the tree in the literature, the current study proposes two resampling methods. Lastly, we incorporate some iterative imputation methods in the literature with decision trees. For a given data set, each variable with missing values will be imputed iteratively until convergence in the iterative imputation method. Hopefully, the relationship between variables can be utilized more effectively. We compare all the methods in some simulation studies. These methods are also applied to two real data sets: Hepatitis Data Set and Credit Approval Data Set.參考文獻 Beaulac, C., and Rosenthal, J. S. (2020). BEST: a decision tree algorithm that handles missing values. Computational Statistics, 35, 1001–1026.Batista, G. E. A. P. A., and Monard, M. C. (2003). An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence, 17, 519–533.Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification and regression trees. Belmont, Calif. : Wadsworth International Group.Efron, B., and Tibshirani, R. J. (1993). An Introduction to the Bootstrap. New York : Chapman & Hall.Fazakis, N., Kostopoulos, G., Kotsiantis, S., and Mporas, I. (2020). Iterative robust semi-supervised missing data imputation. IEEE Access, 8, 90555–90569.James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An Introduction to Statistical Learning : with Applications in R. New York : Springer.Kim, H., and Loh, W.-Y. (2001). Classification Trees with Unbiased Multiway Splits. Journal of the American Statistical Association, 96, 589–604.Luengo, J., García, S., and Herrera, F. (2012). On the choice of the best imputation methods for missing values considering three groups of classification methods. Knowledge and Information Systems, 32, 77–108.Little, R. J. A., and Rubin, D. B. (2020). Statistical analysis with missing data (3rd ed.). Hoboken, NJ : Wiley.Loh, W.-Y., and Shih, Y.-S. (1997). Split Selection Methods for Classification Trees. Statistica Sinica, 7, 815–840.Merz, C., and Murphy, P. (1996). UCI Repository of Machine Learning Databases. University of California, Department of Information and Computer Science, Irvine. (http://www.ics.uci.edu/mlearn/MLRepository.html).Nikfalazar, S., Yeh, C. H., Bedingfield, S., and Khorshidi, H. A. (2020). Missing data imputation using decision trees and fuzzy clustering with iterative learning. Knowledge and Information Systems, 62, 2419–2437.Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. San Mateo, CA : Morgan Kaufmann.Rubin, D. B. (1976). Inference and missing data. Biometrika, 63, 581–592.Rahman, M. G., and Islam, M. Z. (2013). Missing value imputation using decision trees and decision forests by splitting and merging records: Two novel techniques. Knowledge-Based Systems, 53, 51–65.Stekhoven, D. J., and Bühlmann, P. (2011). MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28, 112–118.van Buuren, S., and Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1–67.Zhang, Z. (2016). Missing data imputation: focusing on single imputation. Annals of Translational Medicine, 4, 9. 描述 碩士
國立政治大學
統計學系
109354019資料來源 http://thesis.lib.nccu.edu.tw/record/#G0109354019 資料類型 thesis dc.contributor.advisor 張育瑋 zh_TW dc.contributor.advisor Chang, Yu-Wei en_US dc.contributor.author (Authors) 陳政揚 zh_TW dc.contributor.author (Authors) Chen, Jheng-Yang en_US dc.creator (作者) 陳政揚 zh_TW dc.creator (作者) Chen, Jheng-Yang en_US dc.date (日期) 2022 en_US dc.date.accessioned 1-Aug-2022 17:16:33 (UTC+8) - dc.date.available 1-Aug-2022 17:16:33 (UTC+8) - dc.date.issued (上傳時間) 1-Aug-2022 17:16:33 (UTC+8) - dc.identifier (Other Identifiers) G0109354019 en_US dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/141011 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 統計學系 zh_TW dc.description (描述) 109354019 zh_TW dc.description.abstract (摘要) 遺失值的處理為資料分析前置處理之常見的議題。使用插補遺失值的方式讓資料成為完整資料再進行後續分析,是其中一種常見處理方法。本研究延續以決策樹插補遺失值的研究,比較了幾種文獻上現有的方法及一些小變形的插補表現。除了使用經典的CART演算法以外,也嘗試使用卡方檢定來找分割變數。有別於Rahman 與 Islam (2013) 提出的DMI方法,本文對於訓練資料的選取,不僅限於完全沒有遺失的觀測值,只要在要插補的變數沒有遺失值即可被選為訓練資料,可以更有效使用所有觀測值。在有遺失值的情境下建立決策樹,會遇到分割變數也有遺失值的問題,除了文獻上以平均數或眾數讓其通過的方法,本研究另外考慮兩種重抽的方式讓在分割變數為遺失值的元素通過。此外,參考文獻的一些疊代插補法,並將其運用於決策樹來插補遺失值:對於一筆資料,將各變數的遺失值疊代補值,直到收斂,這樣的方式可以避免文獻使用決策樹插補遺失值的通過問題,並且可以更有效率應用變數之間的關係。本研究使用模擬研究比較上述方法的優缺點,並且將這些方法應用至肝炎資料與信用卡核卡資料這兩筆實際資料。 zh_TW dc.description.abstract (摘要) Dealing with missing values is an issue in data process before we conduct data analysis. It is a popular approach to impute missing data so that we have a complete data set for further data analysis. The current study continues the studies of imputing missing data using decision trees, we modify some methods in the literature and compare their imputation performance. In addition to the CART algorithm, chi-square tests are performed to find the split variable. Different from the DMI method proposed by Rahman and Islam (2013), the composition of the training data set is not limited to those observations without any missing values, but all the observations whose response variable is available are used for training in the current study. Through the modification, we tried to make most use of all the observed data. Besides, one would encounter the issue that there is a missing value in a split variable when building a decision tree using a data set with missing values. In addition to the imputation using the mean or mode so that all elements are able to be available down the tree in the literature, the current study proposes two resampling methods. Lastly, we incorporate some iterative imputation methods in the literature with decision trees. For a given data set, each variable with missing values will be imputed iteratively until convergence in the iterative imputation method. Hopefully, the relationship between variables can be utilized more effectively. We compare all the methods in some simulation studies. These methods are also applied to two real data sets: Hepatitis Data Set and Credit Approval Data Set. en_US dc.description.tableofcontents 第一章 緒論 1第二章 背景知識 42.1 分類樹與迴歸樹 42.2 遺失值的機制 9第三章 研究方法 113.1 具遺失值情境的決策樹建立方法 113.2 訓練資料集在分割變數有遺失值的通過方法 123.3 使用方法一至方法七插補原始資料之遺失值 153.4 疊代插補法 16第四章 模擬研究 184.1 模擬設定 184.2 模擬結果 20第五章 資料分析 435.1 肝炎資料 435.2 信用卡核卡資料 50第六章 結論與討論 58參考文獻 60附錄一 八種插補方法在遺失的五個變數之個別的RMSE值(或準確度) 62 zh_TW dc.format.extent 7483469 bytes - dc.format.mimetype application/pdf - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0109354019 en_US dc.subject (關鍵詞) CART zh_TW dc.subject (關鍵詞) 決策樹 zh_TW dc.subject (關鍵詞) 疊代插補 zh_TW dc.subject (關鍵詞) 插補遺失值 zh_TW dc.subject (關鍵詞) CART en_US dc.subject (關鍵詞) Decision trees en_US dc.subject (關鍵詞) Iterative imputation en_US dc.subject (關鍵詞) Missing data imputation en_US dc.title (題名) 利用決策樹插補遺失值之模擬研究 zh_TW dc.title (題名) Missing Data Imputation with Classification and Regression Trees: A Simulation Study en_US dc.type (資料類型) thesis en_US dc.relation.reference (參考文獻) Beaulac, C., and Rosenthal, J. S. (2020). BEST: a decision tree algorithm that handles missing values. Computational Statistics, 35, 1001–1026.Batista, G. E. A. P. A., and Monard, M. C. (2003). An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence, 17, 519–533.Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification and regression trees. Belmont, Calif. : Wadsworth International Group.Efron, B., and Tibshirani, R. J. (1993). An Introduction to the Bootstrap. New York : Chapman & Hall.Fazakis, N., Kostopoulos, G., Kotsiantis, S., and Mporas, I. (2020). Iterative robust semi-supervised missing data imputation. IEEE Access, 8, 90555–90569.James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An Introduction to Statistical Learning : with Applications in R. New York : Springer.Kim, H., and Loh, W.-Y. (2001). Classification Trees with Unbiased Multiway Splits. Journal of the American Statistical Association, 96, 589–604.Luengo, J., García, S., and Herrera, F. (2012). On the choice of the best imputation methods for missing values considering three groups of classification methods. Knowledge and Information Systems, 32, 77–108.Little, R. J. A., and Rubin, D. B. (2020). Statistical analysis with missing data (3rd ed.). Hoboken, NJ : Wiley.Loh, W.-Y., and Shih, Y.-S. (1997). Split Selection Methods for Classification Trees. Statistica Sinica, 7, 815–840.Merz, C., and Murphy, P. (1996). UCI Repository of Machine Learning Databases. University of California, Department of Information and Computer Science, Irvine. (http://www.ics.uci.edu/mlearn/MLRepository.html).Nikfalazar, S., Yeh, C. H., Bedingfield, S., and Khorshidi, H. A. (2020). Missing data imputation using decision trees and fuzzy clustering with iterative learning. Knowledge and Information Systems, 62, 2419–2437.Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. San Mateo, CA : Morgan Kaufmann.Rubin, D. B. (1976). Inference and missing data. Biometrika, 63, 581–592.Rahman, M. G., and Islam, M. Z. (2013). Missing value imputation using decision trees and decision forests by splitting and merging records: Two novel techniques. Knowledge-Based Systems, 53, 51–65.Stekhoven, D. J., and Bühlmann, P. (2011). MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28, 112–118.van Buuren, S., and Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1–67.Zhang, Z. (2016). Missing data imputation: focusing on single imputation. Annals of Translational Medicine, 4, 9. zh_TW dc.identifier.doi (DOI) 10.6814/NCCU202200953 en_US