基於相關性的類別特徵選擇方法之評估

學術產出-Theses

Article View/Open

pdf(0)

Publication Export

Google Scholar^TM

政大圖書館

學術資源探索系統

Citation Infomation

Simple Record
Full Record

題名	基於相關性的類別特徵選擇方法之評估 An Evaluation of Correlation-Based Categorical Feature Selection Methods
作者	張智鈞 Chang, Chih-Chun
貢獻者	周珮婷 Chou, Pei-Ting 張智鈞 Chang, Chih-Chun
關鍵詞	變數篩選維度縮減變數相關性過濾法熵類別型資料 Feature selection Dimension reduction Variable association Filter method Entropy Categorical datasets
日期	2022
上傳時間	1-Jul-2022 16:58:28 (UTC+8)
摘要	隨著機器學習的蓬勃發展，變數篩選之重要性不言而喻，適當地挑選變數可以優化統計模型的預測表現、降低電腦計算成本以及幫助分析者更加理解資料蘊含的意義。變數篩選主要分為過濾法、包裝法和嵌入法三種，本研究旨在使用一些能夠計算變數相關性的指標，如皮爾森積差相關係數、條件熵、交叉熵、相對熵、Goodman and Kruskal’s τ、克拉瑪V係數等，結合過濾法進行變數篩選，並探討於不同指標下各個資料集的預測表現，亦會比較與原始資料集預測表現的差異。本研究共使用十筆資料進行實驗，包含兩筆模擬資料和八筆真實資料，其中大部分為類別型資料。在模擬資料中，本研究發現在資料變數為類別型的情況下，條件熵挑選重要變數的能力優於其他指標。在真實資料中，部分資料使用過濾法進行變數篩選後，仍有不錯的預測表現，然而亦有部分資料的預測表現不佳，推測可能和解釋變數之類別個數過多、觀測值過少、資料不平衡以及將連續型變數離散化時轉換不當有關。本研究認為類別個數過多、觀測值過少和資料不平衡的問題可以嘗試透過適當地合併類別去處理，而將連續型變數離散化時可依據原始資料的分配切分。未來的研究方向應著重於如何針對類別型資料設定挑選變數的門檻，以及是否能將過濾法與包裝法和嵌入法結合出新的演算法，進而更精準地篩選出重要的變數，並提升資料分析的效率。 With the vigorous development of machine learning, the importance of feature selection is self-evident. Appropriate selection of features can optimize the accuracy of statistical models, reduce computational costs, and help analysts better comprehend the data. Feature selection is mainly divided into filter, wrapper and embedded method, and we put emphasis on filter method. This study implemented filter method by utilizing several indices which can calculate the association of variables such as Pearson correlation coefficient, entropy, Goodman and Kruskal’s τ, Cramer’s V, etc., and we also compared the performance on dimensionally reduced datasets under each index and original datasets. Moreover, we used ten datasets to conduct experiments, including two simulated and eight real datasets, most of which are categorical. Among these simulated datasets, we found that the ability of conditional entropy to select important variables was better than other indices in categorical variables. Among the real datasets, some of them still had good performance while the others did not. We speculated that the poor performance was associated with excessive categories, lacking in observations, imbalanced dataset and improper discretizing continuous variables. We believed that these problems can be solved by advisably merging categories and discretizing continuous variables according to the distribution of the original datasets. Future study should mainly focus on how to moderately set the threshold when filtering variables in categorical datasets, and whether new algorithms can be created by integrating filter, wrapper and embedded methods to enhance the performance on feature selection and improve the efficiency of categorical data analysis.
參考文獻	Akoglu, H. (2018). User`s guide to correlation coefficients. Turkish journal of emergency medicine, 18(3), 91-93. Altman, N. S. (1992). An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3), 175-185. Belhumeur, P. N., Hespanha, J. P., & Kriegman, D. J. (1997). Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Transactions on pattern analysis and machine intelligence, 19(7), 711-720. Beh, E. J., & Davy, P. J. (1998). Theory & Methods: Partitioning Pearson’s Chi‐Squared Statistic for a Completely Ordered Three‐Way Contingency Table. Australian & New Zealand Journal of Statistics, 40(4), 465-477. Boltz, S., Debreuve, E., & Barlaud, M. (2007). kNN-based high-dimensional Kullback-Leibler distance for tracking. In Eighth International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS`07) (pp. 16-16). IEEE. Boltz, S., Debreuve, E., & Barlaud, M. (2009). High-dimensional statistical measure for region-of-interest tracking. IEEE Transactions on Image Processing, 18(6), 1266-1283. Bommert, A., Sun, X., Bischl, B., Rahnenführer, J., & Lang, M. (2020). Benchmark for filter methods for feature selection in high-dimensional classification data. Computational Statistics & Data Analysis, 143, 106839. Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32. Chandrashekar, G., & Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical Engineering, 40(1), 16-28. Cover, T. M., & Thomas, J. A. (1991). Entropy, relative entropy and mutual information. Elements of information theory, 2(1), 12-13. Cortez, P., & Silva, A. M. G. (2008). Using data mining to predict secondary school student performance. Proceedings of 5th Future Business Technology Conference (FUBUTEC 2008) pp. 5-12. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273-297. Cox, D. R. (1958). The regression analysis of binary sequences. Journal of the Royal Statistical Society: Series B (Methodological), 20(2), 215-232. Cramer, H. (1946). Mathematical methods of statistics, Princeton Univ. Press, Princeton, NJ. D`Ambra, L., & Lauro, N. (1989). Non symmetrical analysis of three-way contingency tables. In Multiway data analysis (pp. 301-315). D’Ambra, L., Beh, E. J., & Lombardo, R. (2005). Decomposing Goodman-Kruskal tau for Ordinal Categorical Variables. International Statistical Institute, 55th. Goodman, L. A., & Kruskal, W. H. (1954). Measures of association for cross classifications. Journal of the American Statistical Association, 49, 732–769. Gruosso, T., Mieulet, V., Cardon, M., Bourachot, B., Kieffer, Y., Devun, F., ... & Mechta‐Grigoriou, F. (2016). Chronic oxidative stress promotes H2 AX protein degradation and enhances chemosensitivity in breast cancer patients. EMBO molecular medicine, 8(5), 527-549. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), 1157-1182. Guyon, I., Gunn, S., Nikravesh, M., & Zadeh, L. A. (Eds.). (2008). Feature extraction: foundations and applications (Vol. 207). Springer. Hull, J. J. (1994). A database for handwritten text recognition. IEEE Trans. Pattern Anal. Mach. Intelligence, 16(5), 550-554. Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The annals of mathematical statistics, 22(1), 79-86. Kurgan, L. A., Cios, K. J., Tadeusiewicz, R., Ogiela, M., & Goodenday, L. S. (2001). Knowledge discovery approach to automated cardiac SPECT diagnosis. Artificial intelligence in medicine, 23(2), 149-169. Masoudi-Sobhanzadeh, Y., Motieghader, H., & Masoudi-Nejad, A. (2019). FeatureSelect: a software for feature selection based on machine learning approaches. BMC bioinformatics, 20(1), 1-17. National Development Council (2020). 2018 Mobile Phone Users` Digital Opportunity Survey (AE080006) [data file]. Available from Survey Research Data Archive, Academia Sinica. doi:10.6141/TW-SRDA-AE080006-1 Pearson, K. (1895). VII. Note on regression and inheritance in the case of two parents. proceedings of the royal society of London, 58(347-352), 240-242. Remeseiro, B., & Bolon-Canedo, V. (2019). A review of feature selection methods in medical applications. Computers in biology and medicine, 112, 103375. Rodriguez-Galiano, V. F., Luque-Espinar, J. A., Chica-Olmo, M., & Mendes, M. P. (2018). Feature selection approaches for predictive modelling of groundwater nitrate pollution: An evaluation of filters, embedded and wrapper methods. Science of the total environment, 624, 661-672. Shannon, C. E. (1948). A mathematical theory of communication. The Bell system technical journal, 27(3), 379-423. Sun, Y., Lu, C., & Li, X. (2018). The cross-entropy based multi-filter ensemble method for gene selection. Genes, 9(5), 258. Wah, Y. B., Ibrahim, N., Hamid, H. A., Abdul-Rahman, S., & Fong, S. (2018). Feature Selection Methods: Case of Filter and Wrapper Approaches for Maximising Classification Accuracy. Pertanika Journal of Science & Technology, 26(1). Wang, J., Xu, J., Zhao, C., Peng, Y., & Wang, H. (2019). An ensemble feature selection method for high-dimensional data based on sort aggregation. Systems Science & Control Engineering, 7(2), 32-39. Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Icml (Vol. 97, No. 412-420, p. 35). Yöntem, M. K., Kemal, A. D. E. M., Ilhan, T., & KILIÇARSLAN, S. (2019). Divorce prediction using correlation-based feature selection and artificial neural networks. Nevşehir Hacı Bektaş Veli Üniversitesi SBE Dergisi, 9(1), 259-273.
描述	碩士國立政治大學統計學系 109354026
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0109354026
資料類型	thesis

dc.contributor.advisor	周珮婷	zh_TW
dc.contributor.advisor	Chou, Pei-Ting	en_US
dc.contributor.author (Authors)	張智鈞	zh_TW
dc.contributor.author (Authors)	Chang, Chih-Chun	en_US
dc.creator (作者)	張智鈞	zh_TW
dc.creator (作者)	Chang, Chih-Chun	en_US
dc.date (日期)	2022	en_US
dc.date.accessioned	1-Jul-2022 16:58:28 (UTC+8)	-
dc.date.available	1-Jul-2022 16:58:28 (UTC+8)	-
dc.date.issued (上傳時間)	1-Jul-2022 16:58:28 (UTC+8)	-
dc.identifier (Other Identifiers)	G0109354026	en_US
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/140755	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	統計學系	zh_TW
dc.description (描述)	109354026	zh_TW
dc.description.abstract (摘要)	隨著機器學習的蓬勃發展，變數篩選之重要性不言而喻，適當地挑選變數可以優化統計模型的預測表現、降低電腦計算成本以及幫助分析者更加理解資料蘊含的意義。變數篩選主要分為過濾法、包裝法和嵌入法三種，本研究旨在使用一些能夠計算變數相關性的指標，如皮爾森積差相關係數、條件熵、交叉熵、相對熵、Goodman and Kruskal’s τ、克拉瑪V係數等，結合過濾法進行變數篩選，並探討於不同指標下各個資料集的預測表現，亦會比較與原始資料集預測表現的差異。本研究共使用十筆資料進行實驗，包含兩筆模擬資料和八筆真實資料，其中大部分為類別型資料。在模擬資料中，本研究發現在資料變數為類別型的情況下，條件熵挑選重要變數的能力優於其他指標。在真實資料中，部分資料使用過濾法進行變數篩選後，仍有不錯的預測表現，然而亦有部分資料的預測表現不佳，推測可能和解釋變數之類別個數過多、觀測值過少、資料不平衡以及將連續型變數離散化時轉換不當有關。本研究認為類別個數過多、觀測值過少和資料不平衡的問題可以嘗試透過適當地合併類別去處理，而將連續型變數離散化時可依據原始資料的分配切分。未來的研究方向應著重於如何針對類別型資料設定挑選變數的門檻，以及是否能將過濾法與包裝法和嵌入法結合出新的演算法，進而更精準地篩選出重要的變數，並提升資料分析的效率。	zh_TW
dc.description.abstract (摘要)	With the vigorous development of machine learning, the importance of feature selection is self-evident. Appropriate selection of features can optimize the accuracy of statistical models, reduce computational costs, and help analysts better comprehend the data. Feature selection is mainly divided into filter, wrapper and embedded method, and we put emphasis on filter method. This study implemented filter method by utilizing several indices which can calculate the association of variables such as Pearson correlation coefficient, entropy, Goodman and Kruskal’s τ, Cramer’s V, etc., and we also compared the performance on dimensionally reduced datasets under each index and original datasets. Moreover, we used ten datasets to conduct experiments, including two simulated and eight real datasets, most of which are categorical. Among these simulated datasets, we found that the ability of conditional entropy to select important variables was better than other indices in categorical variables. Among the real datasets, some of them still had good performance while the others did not. We speculated that the poor performance was associated with excessive categories, lacking in observations, imbalanced dataset and improper discretizing continuous variables. We believed that these problems can be solved by advisably merging categories and discretizing continuous variables according to the distribution of the original datasets. Future study should mainly focus on how to moderately set the threshold when filtering variables in categorical datasets, and whether new algorithms can be created by integrating filter, wrapper and embedded methods to enhance the performance on feature selection and improve the efficiency of categorical data analysis.	en_US
dc.description.tableofcontents	第壹章緒論 1 第一節變數篩選現況 1 第二節研究動機及目的 2 第貳章文獻探討 4 第參章研究方法及資料介紹 6 第一節使用的指標 6 第二節使用的演算法 11 第三節研究資料介紹 12 第肆章研究流程與結果討論 17 第一節實驗過程與結果紀錄 17 第二節實驗結果討論 29 第伍章結論與建議 34 參考文獻 36 附錄 39	zh_TW
dc.format.extent	6187986 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0109354026	en_US
dc.subject (關鍵詞)	變數篩選	zh_TW
dc.subject (關鍵詞)	維度縮減	zh_TW
dc.subject (關鍵詞)	變數相關性	zh_TW
dc.subject (關鍵詞)	過濾法	zh_TW
dc.subject (關鍵詞)	熵	zh_TW
dc.subject (關鍵詞)	類別型資料	zh_TW
dc.subject (關鍵詞)	Feature selection	en_US
dc.subject (關鍵詞)	Dimension reduction	en_US
dc.subject (關鍵詞)	Variable association	en_US
dc.subject (關鍵詞)	Filter method	en_US
dc.subject (關鍵詞)	Entropy	en_US
dc.subject (關鍵詞)	Categorical datasets	en_US
dc.title (題名)	基於相關性的類別特徵選擇方法之評估	zh_TW
dc.title (題名)	An Evaluation of Correlation-Based Categorical Feature Selection Methods	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	Akoglu, H. (2018). User`s guide to correlation coefficients. Turkish journal of emergency medicine, 18(3), 91-93. Altman, N. S. (1992). An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3), 175-185. Belhumeur, P. N., Hespanha, J. P., & Kriegman, D. J. (1997). Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Transactions on pattern analysis and machine intelligence, 19(7), 711-720. Beh, E. J., & Davy, P. J. (1998). Theory & Methods: Partitioning Pearson’s Chi‐Squared Statistic for a Completely Ordered Three‐Way Contingency Table. Australian & New Zealand Journal of Statistics, 40(4), 465-477. Boltz, S., Debreuve, E., & Barlaud, M. (2007). kNN-based high-dimensional Kullback-Leibler distance for tracking. In Eighth International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS`07) (pp. 16-16). IEEE. Boltz, S., Debreuve, E., & Barlaud, M. (2009). High-dimensional statistical measure for region-of-interest tracking. IEEE Transactions on Image Processing, 18(6), 1266-1283. Bommert, A., Sun, X., Bischl, B., Rahnenführer, J., & Lang, M. (2020). Benchmark for filter methods for feature selection in high-dimensional classification data. Computational Statistics & Data Analysis, 143, 106839. Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32. Chandrashekar, G., & Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical Engineering, 40(1), 16-28. Cover, T. M., & Thomas, J. A. (1991). Entropy, relative entropy and mutual information. Elements of information theory, 2(1), 12-13. Cortez, P., & Silva, A. M. G. (2008). Using data mining to predict secondary school student performance. Proceedings of 5th Future Business Technology Conference (FUBUTEC 2008) pp. 5-12. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273-297. Cox, D. R. (1958). The regression analysis of binary sequences. Journal of the Royal Statistical Society: Series B (Methodological), 20(2), 215-232. Cramer, H. (1946). Mathematical methods of statistics, Princeton Univ. Press, Princeton, NJ. D`Ambra, L., & Lauro, N. (1989). Non symmetrical analysis of three-way contingency tables. In Multiway data analysis (pp. 301-315). D’Ambra, L., Beh, E. J., & Lombardo, R. (2005). Decomposing Goodman-Kruskal tau for Ordinal Categorical Variables. International Statistical Institute, 55th. Goodman, L. A., & Kruskal, W. H. (1954). Measures of association for cross classifications. Journal of the American Statistical Association, 49, 732–769. Gruosso, T., Mieulet, V., Cardon, M., Bourachot, B., Kieffer, Y., Devun, F., ... & Mechta‐Grigoriou, F. (2016). Chronic oxidative stress promotes H2 AX protein degradation and enhances chemosensitivity in breast cancer patients. EMBO molecular medicine, 8(5), 527-549. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), 1157-1182. Guyon, I., Gunn, S., Nikravesh, M., & Zadeh, L. A. (Eds.). (2008). Feature extraction: foundations and applications (Vol. 207). Springer. Hull, J. J. (1994). A database for handwritten text recognition. IEEE Trans. Pattern Anal. Mach. Intelligence, 16(5), 550-554. Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The annals of mathematical statistics, 22(1), 79-86. Kurgan, L. A., Cios, K. J., Tadeusiewicz, R., Ogiela, M., & Goodenday, L. S. (2001). Knowledge discovery approach to automated cardiac SPECT diagnosis. Artificial intelligence in medicine, 23(2), 149-169. Masoudi-Sobhanzadeh, Y., Motieghader, H., & Masoudi-Nejad, A. (2019). FeatureSelect: a software for feature selection based on machine learning approaches. BMC bioinformatics, 20(1), 1-17. National Development Council (2020). 2018 Mobile Phone Users` Digital Opportunity Survey (AE080006) [data file]. Available from Survey Research Data Archive, Academia Sinica. doi:10.6141/TW-SRDA-AE080006-1 Pearson, K. (1895). VII. Note on regression and inheritance in the case of two parents. proceedings of the royal society of London, 58(347-352), 240-242. Remeseiro, B., & Bolon-Canedo, V. (2019). A review of feature selection methods in medical applications. Computers in biology and medicine, 112, 103375. Rodriguez-Galiano, V. F., Luque-Espinar, J. A., Chica-Olmo, M., & Mendes, M. P. (2018). Feature selection approaches for predictive modelling of groundwater nitrate pollution: An evaluation of filters, embedded and wrapper methods. Science of the total environment, 624, 661-672. Shannon, C. E. (1948). A mathematical theory of communication. The Bell system technical journal, 27(3), 379-423. Sun, Y., Lu, C., & Li, X. (2018). The cross-entropy based multi-filter ensemble method for gene selection. Genes, 9(5), 258. Wah, Y. B., Ibrahim, N., Hamid, H. A., Abdul-Rahman, S., & Fong, S. (2018). Feature Selection Methods: Case of Filter and Wrapper Approaches for Maximising Classification Accuracy. Pertanika Journal of Science & Technology, 26(1). Wang, J., Xu, J., Zhao, C., Peng, Y., & Wang, H. (2019). An ensemble feature selection method for high-dimensional data based on sort aggregation. Systems Science & Control Engineering, 7(2), 32-39. Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Icml (Vol. 97, No. 412-420, p. 35). Yöntem, M. K., Kemal, A. D. E. M., Ilhan, T., & KILIÇARSLAN, S. (2019). Divorce prediction using correlation-based feature selection and artificial neural networks. Nevşehir Hacı Bektaş Veli Üniversitesi SBE Dergisi, 9(1), 259-273.	zh_TW
dc.identifier.doi (DOI)	10.6814/NCCU202200500	en_US

學術產出-Theses

Article View/Open

Publication Export

Google ScholarTM

政大圖書館

Citation Infomation

Google Scholar^TM