學術產出-Theses

Article View/Open

Publication Export

Google ScholarTM

政大圖書館

Citation Infomation

題名 利用決策樹插補遺失值之模擬研究
Missing Data Imputation with Classification and Regression Trees: A Simulation Study
作者 陳政揚
Chen, Jheng-Yang
貢獻者 張育瑋
Chang, Yu-Wei
陳政揚
Chen, Jheng-Yang
關鍵詞 CART
決策樹
疊代插補
插補遺失值
CART
Decision trees
Iterative imputation
Missing data imputation
日期 2022
上傳時間 1-Aug-2022 17:16:33 (UTC+8)
摘要 遺失值的處理為資料分析前置處理之常見的議題。使用插補遺失值的方式讓資料成為完整資料再進行後續分析,是其中一種常見處理方法。本研究延續以決策樹插補遺失值的研究,比較了幾種文獻上現有的方法及一些小變形的插補表現。除了使用經典的CART演算法以外,也嘗試使用卡方檢定來找分割變數。有別於Rahman 與 Islam (2013) 提出的DMI方法,本文對於訓練資料的選取,不僅限於完全沒有遺失的觀測值,只要在要插補的變數沒有遺失值即可被選為訓練資料,可以更有效使用所有觀測值。在有遺失值的情境下建立決策樹,會遇到分割變數也有遺失值的問題,除了文獻上以平均數或眾數讓其通過的方法,本研究另外考慮兩種重抽的方式讓在分割變數為遺失值的元素通過。此外,參考文獻的一些疊代插補法,並將其運用於決策樹來插補遺失值:對於一筆資料,將各變數的遺失值疊代補值,直到收斂,這樣的方式可以避免文獻使用決策樹插補遺失值的通過問題,並且可以更有效率應用變數之間的關係。本研究使用模擬研究比較上述方法的優缺點,並且將這些方法應用至肝炎資料與信用卡核卡資料這兩筆實際資料。
Dealing with missing values is an issue in data process before we conduct data analysis. It is a popular approach to impute missing data so that we have a complete data set for further data analysis. The current study continues the studies of imputing missing data using decision trees, we modify some methods in the literature and compare their imputation performance. In addition to the CART algorithm, chi-square tests are performed to find the split variable. Different from the DMI method proposed by Rahman and Islam (2013), the composition of the training data set is not limited to those observations without any missing values, but all the observations whose response variable is available are used for training in the current study. Through the modification, we tried to make most use of all the observed data. Besides, one would encounter the issue that there is a missing value in a split variable when building a decision tree using a data set with missing values. In addition to the imputation using the mean or mode so that all elements are able to be available down the tree in the literature, the current study proposes two resampling methods. Lastly, we incorporate some iterative imputation methods in the literature with decision trees. For a given data set, each variable with missing values will be imputed iteratively until convergence in the iterative imputation method. Hopefully, the relationship between variables can be utilized more effectively. We compare all the methods in some simulation studies. These methods are also applied to two real data sets: Hepatitis Data Set and Credit Approval Data Set.
參考文獻 Beaulac, C., and Rosenthal, J. S. (2020). BEST: a decision tree algorithm that handles missing values. Computational Statistics, 35, 1001–1026.
Batista, G. E. A. P. A., and Monard, M. C. (2003). An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence, 17, 519–533.
Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification and regression trees. Belmont, Calif. : Wadsworth International Group.
Efron, B., and Tibshirani, R. J. (1993). An Introduction to the Bootstrap. New York : Chapman & Hall.
Fazakis, N., Kostopoulos, G., Kotsiantis, S., and Mporas, I. (2020). Iterative robust semi-supervised missing data imputation. IEEE Access, 8, 90555–90569.
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An Introduction to Statistical Learning : with Applications in R. New York : Springer.
Kim, H., and Loh, W.-Y. (2001). Classification Trees with Unbiased Multiway Splits. Journal of the American Statistical Association, 96, 589–604.
Luengo, J., García, S., and Herrera, F. (2012). On the choice of the best imputation methods for missing values considering three groups of classification methods. Knowledge and Information Systems, 32, 77–108.
Little, R. J. A., and Rubin, D. B. (2020). Statistical analysis with missing data (3rd ed.). Hoboken, NJ : Wiley.
Loh, W.-Y., and Shih, Y.-S. (1997). Split Selection Methods for Classification Trees. Statistica Sinica, 7, 815–840.
Merz, C., and Murphy, P. (1996). UCI Repository of Machine Learning Databases. University of California, Department of Information and Computer Science, Irvine. (http://www.ics.uci.edu/mlearn/MLRepository.html).
Nikfalazar, S., Yeh, C. H., Bedingfield, S., and Khorshidi, H. A. (2020). Missing data imputation using decision trees and fuzzy clustering with iterative learning. Knowledge and Information Systems, 62, 2419–2437.
Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. San Mateo, CA : Morgan Kaufmann.
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63, 581–592.
Rahman, M. G., and Islam, M. Z. (2013). Missing value imputation using decision trees and decision forests by splitting and merging records: Two novel techniques. Knowledge-Based Systems, 53, 51–65.
Stekhoven, D. J., and Bühlmann, P. (2011). MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28, 112–118.
van Buuren, S., and Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1–67.
Zhang, Z. (2016). Missing data imputation: focusing on single imputation. Annals of Translational Medicine, 4, 9.
描述 碩士
國立政治大學
統計學系
109354019
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0109354019
資料類型 thesis
dc.contributor.advisor 張育瑋zh_TW
dc.contributor.advisor Chang, Yu-Weien_US
dc.contributor.author (Authors) 陳政揚zh_TW
dc.contributor.author (Authors) Chen, Jheng-Yangen_US
dc.creator (作者) 陳政揚zh_TW
dc.creator (作者) Chen, Jheng-Yangen_US
dc.date (日期) 2022en_US
dc.date.accessioned 1-Aug-2022 17:16:33 (UTC+8)-
dc.date.available 1-Aug-2022 17:16:33 (UTC+8)-
dc.date.issued (上傳時間) 1-Aug-2022 17:16:33 (UTC+8)-
dc.identifier (Other Identifiers) G0109354019en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/141011-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 統計學系zh_TW
dc.description (描述) 109354019zh_TW
dc.description.abstract (摘要) 遺失值的處理為資料分析前置處理之常見的議題。使用插補遺失值的方式讓資料成為完整資料再進行後續分析,是其中一種常見處理方法。本研究延續以決策樹插補遺失值的研究,比較了幾種文獻上現有的方法及一些小變形的插補表現。除了使用經典的CART演算法以外,也嘗試使用卡方檢定來找分割變數。有別於Rahman 與 Islam (2013) 提出的DMI方法,本文對於訓練資料的選取,不僅限於完全沒有遺失的觀測值,只要在要插補的變數沒有遺失值即可被選為訓練資料,可以更有效使用所有觀測值。在有遺失值的情境下建立決策樹,會遇到分割變數也有遺失值的問題,除了文獻上以平均數或眾數讓其通過的方法,本研究另外考慮兩種重抽的方式讓在分割變數為遺失值的元素通過。此外,參考文獻的一些疊代插補法,並將其運用於決策樹來插補遺失值:對於一筆資料,將各變數的遺失值疊代補值,直到收斂,這樣的方式可以避免文獻使用決策樹插補遺失值的通過問題,並且可以更有效率應用變數之間的關係。本研究使用模擬研究比較上述方法的優缺點,並且將這些方法應用至肝炎資料與信用卡核卡資料這兩筆實際資料。zh_TW
dc.description.abstract (摘要) Dealing with missing values is an issue in data process before we conduct data analysis. It is a popular approach to impute missing data so that we have a complete data set for further data analysis. The current study continues the studies of imputing missing data using decision trees, we modify some methods in the literature and compare their imputation performance. In addition to the CART algorithm, chi-square tests are performed to find the split variable. Different from the DMI method proposed by Rahman and Islam (2013), the composition of the training data set is not limited to those observations without any missing values, but all the observations whose response variable is available are used for training in the current study. Through the modification, we tried to make most use of all the observed data. Besides, one would encounter the issue that there is a missing value in a split variable when building a decision tree using a data set with missing values. In addition to the imputation using the mean or mode so that all elements are able to be available down the tree in the literature, the current study proposes two resampling methods. Lastly, we incorporate some iterative imputation methods in the literature with decision trees. For a given data set, each variable with missing values will be imputed iteratively until convergence in the iterative imputation method. Hopefully, the relationship between variables can be utilized more effectively. We compare all the methods in some simulation studies. These methods are also applied to two real data sets: Hepatitis Data Set and Credit Approval Data Set.en_US
dc.description.tableofcontents 第一章 緒論 1
第二章 背景知識 4
2.1 分類樹與迴歸樹 4
2.2 遺失值的機制 9
第三章 研究方法 11
3.1 具遺失值情境的決策樹建立方法 11
3.2 訓練資料集在分割變數有遺失值的通過方法 12
3.3 使用方法一至方法七插補原始資料之遺失值 15
3.4 疊代插補法 16
第四章 模擬研究 18
4.1 模擬設定 18
4.2 模擬結果 20
第五章 資料分析 43
5.1 肝炎資料 43
5.2 信用卡核卡資料 50
第六章 結論與討論 58
參考文獻 60
附錄一 八種插補方法在遺失的五個變數之個別的RMSE值(或準確度) 62
zh_TW
dc.format.extent 7483469 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0109354019en_US
dc.subject (關鍵詞) CARTzh_TW
dc.subject (關鍵詞) 決策樹zh_TW
dc.subject (關鍵詞) 疊代插補zh_TW
dc.subject (關鍵詞) 插補遺失值zh_TW
dc.subject (關鍵詞) CARTen_US
dc.subject (關鍵詞) Decision treesen_US
dc.subject (關鍵詞) Iterative imputationen_US
dc.subject (關鍵詞) Missing data imputationen_US
dc.title (題名) 利用決策樹插補遺失值之模擬研究zh_TW
dc.title (題名) Missing Data Imputation with Classification and Regression Trees: A Simulation Studyen_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) Beaulac, C., and Rosenthal, J. S. (2020). BEST: a decision tree algorithm that handles missing values. Computational Statistics, 35, 1001–1026.
Batista, G. E. A. P. A., and Monard, M. C. (2003). An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence, 17, 519–533.
Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification and regression trees. Belmont, Calif. : Wadsworth International Group.
Efron, B., and Tibshirani, R. J. (1993). An Introduction to the Bootstrap. New York : Chapman & Hall.
Fazakis, N., Kostopoulos, G., Kotsiantis, S., and Mporas, I. (2020). Iterative robust semi-supervised missing data imputation. IEEE Access, 8, 90555–90569.
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An Introduction to Statistical Learning : with Applications in R. New York : Springer.
Kim, H., and Loh, W.-Y. (2001). Classification Trees with Unbiased Multiway Splits. Journal of the American Statistical Association, 96, 589–604.
Luengo, J., García, S., and Herrera, F. (2012). On the choice of the best imputation methods for missing values considering three groups of classification methods. Knowledge and Information Systems, 32, 77–108.
Little, R. J. A., and Rubin, D. B. (2020). Statistical analysis with missing data (3rd ed.). Hoboken, NJ : Wiley.
Loh, W.-Y., and Shih, Y.-S. (1997). Split Selection Methods for Classification Trees. Statistica Sinica, 7, 815–840.
Merz, C., and Murphy, P. (1996). UCI Repository of Machine Learning Databases. University of California, Department of Information and Computer Science, Irvine. (http://www.ics.uci.edu/mlearn/MLRepository.html).
Nikfalazar, S., Yeh, C. H., Bedingfield, S., and Khorshidi, H. A. (2020). Missing data imputation using decision trees and fuzzy clustering with iterative learning. Knowledge and Information Systems, 62, 2419–2437.
Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. San Mateo, CA : Morgan Kaufmann.
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63, 581–592.
Rahman, M. G., and Islam, M. Z. (2013). Missing value imputation using decision trees and decision forests by splitting and merging records: Two novel techniques. Knowledge-Based Systems, 53, 51–65.
Stekhoven, D. J., and Bühlmann, P. (2011). MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28, 112–118.
van Buuren, S., and Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1–67.
Zhang, Z. (2016). Missing data imputation: focusing on single imputation. Annals of Translational Medicine, 4, 9.
zh_TW
dc.identifier.doi (DOI) 10.6814/NCCU202200953en_US