學術產出-Theses

Article View/Open

Publication Export

Google ScholarTM

政大圖書館

Citation Infomation

題名 基於相關性的類別特徵選擇方法之評估
An Evaluation of Correlation-Based Categorical Feature Selection Methods
作者 張智鈞
Chang, Chih-Chun
貢獻者 周珮婷
Chou, Pei-Ting
張智鈞
Chang, Chih-Chun
關鍵詞 變數篩選
維度縮減
變數相關性
過濾法

類別型資料
Feature selection
Dimension reduction
Variable association
Filter method
Entropy
Categorical datasets
日期 2022
上傳時間 1-Jul-2022 16:58:28 (UTC+8)
摘要 隨著機器學習的蓬勃發展,變數篩選之重要性不言而喻,適當地挑選變數可以優化統計模型的預測表現、降低電腦計算成本以及幫助分析者更加理解資料蘊含的意義。變數篩選主要分為過濾法、包裝法和嵌入法三種,本研究旨在使用一些能夠計算變數相關性的指標,如皮爾森積差相關係數、條件熵、交叉熵、相對熵、Goodman and Kruskal’s τ、克拉瑪V係數等,結合過濾法進行變數篩選,並探討於不同指標下各個資料集的預測表現,亦會比較與原始資料集預測表現的差異。本研究共使用十筆資料進行實驗,包含兩筆模擬資料和八筆真實資料,其中大部分為類別型資料。
在模擬資料中,本研究發現在資料變數為類別型的情況下,條件熵挑選重要變數的能力優於其他指標。在真實資料中,部分資料使用過濾法進行變數篩選後,仍有不錯的預測表現,然而亦有部分資料的預測表現不佳,推測可能和解釋變數之類別個數過多、觀測值過少、資料不平衡以及將連續型變數離散化時轉換不當有關。本研究認為類別個數過多、觀測值過少和資料不平衡的問題可以嘗試透過適當地合併類別去處理,而將連續型變數離散化時可依據原始資料的分配切分。
未來的研究方向應著重於如何針對類別型資料設定挑選變數的門檻,以及是否能將過濾法與包裝法和嵌入法結合出新的演算法,進而更精準地篩選出重要的變數,並提升資料分析的效率。
With the vigorous development of machine learning, the importance of feature selection is self-evident. Appropriate selection of features can optimize the accuracy of statistical models, reduce computational costs, and help analysts better comprehend the data. Feature selection is mainly divided into filter, wrapper and embedded method, and we put emphasis on filter method. This study implemented filter method by utilizing several indices which can calculate the association of variables such as Pearson correlation coefficient, entropy, Goodman and Kruskal’s τ, Cramer’s V, etc., and we also compared the performance on dimensionally reduced datasets under each index and original datasets. Moreover, we used ten datasets to conduct experiments, including two simulated and eight real datasets, most of which are categorical. Among these simulated datasets, we found that the ability of conditional entropy to select important variables was better than other indices in categorical variables. Among the real datasets, some of them still had good performance while the others did not. We speculated that the poor performance was associated with excessive categories, lacking in observations, imbalanced dataset and improper discretizing continuous variables. We believed that these problems can be solved by advisably merging categories and discretizing continuous variables according to the distribution of the original datasets. Future study should mainly focus on how to moderately set the threshold when filtering variables in categorical datasets, and whether new algorithms can be created by integrating filter, wrapper and embedded methods to enhance the performance on feature selection and improve the efficiency of categorical data analysis.
參考文獻 Akoglu, H. (2018). User`s guide to correlation coefficients. Turkish journal of emergency medicine, 18(3), 91-93.
Altman, N. S. (1992). An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3), 175-185.
Belhumeur, P. N., Hespanha, J. P., & Kriegman, D. J. (1997). Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Transactions on pattern analysis and machine intelligence, 19(7), 711-720.
Beh, E. J., & Davy, P. J. (1998). Theory & Methods: Partitioning Pearson’s Chi‐Squared Statistic for a Completely Ordered Three‐Way Contingency Table. Australian & New Zealand Journal of Statistics, 40(4), 465-477.
Boltz, S., Debreuve, E., & Barlaud, M. (2007). kNN-based high-dimensional Kullback-Leibler distance for tracking. In Eighth International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS`07) (pp. 16-16). IEEE.
Boltz, S., Debreuve, E., & Barlaud, M. (2009). High-dimensional statistical measure for region-of-interest tracking. IEEE Transactions on Image Processing, 18(6), 1266-1283.
Bommert, A., Sun, X., Bischl, B., Rahnenführer, J., & Lang, M. (2020). Benchmark for filter methods for feature selection in high-dimensional classification data. Computational Statistics & Data Analysis, 143, 106839.
Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.
Chandrashekar, G., & Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical Engineering, 40(1), 16-28.
Cover, T. M., & Thomas, J. A. (1991). Entropy, relative entropy and mutual information. Elements of information theory, 2(1), 12-13.
Cortez, P., & Silva, A. M. G. (2008). Using data mining to predict secondary school student performance. Proceedings of 5th Future Business Technology Conference (FUBUTEC 2008) pp. 5-12.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273-297.
Cox, D. R. (1958). The regression analysis of binary sequences. Journal of the Royal Statistical Society: Series B (Methodological), 20(2), 215-232.
Cramer, H. (1946). Mathematical methods of statistics, Princeton Univ. Press, Princeton, NJ.
D`Ambra, L., & Lauro, N. (1989). Non symmetrical analysis of three-way contingency tables. In Multiway data analysis (pp. 301-315).
D’Ambra, L., Beh, E. J., & Lombardo, R. (2005). Decomposing Goodman-Kruskal tau for Ordinal Categorical Variables. International Statistical Institute, 55th.
Goodman, L. A., & Kruskal, W. H. (1954). Measures of association for cross classifications. Journal of the American Statistical Association, 49, 732–769.
Gruosso, T., Mieulet, V., Cardon, M., Bourachot, B., Kieffer, Y., Devun, F., ... & Mechta‐Grigoriou, F. (2016). Chronic oxidative stress promotes H2 AX protein degradation and enhances chemosensitivity in breast cancer patients. EMBO molecular medicine, 8(5), 527-549.
Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), 1157-1182.
Guyon, I., Gunn, S., Nikravesh, M., & Zadeh, L. A. (Eds.). (2008). Feature extraction: foundations and applications (Vol. 207). Springer.
Hull, J. J. (1994). A database for handwritten text recognition. IEEE Trans. Pattern Anal. Mach. Intelligence, 16(5), 550-554.
Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The annals of mathematical statistics, 22(1), 79-86.
Kurgan, L. A., Cios, K. J., Tadeusiewicz, R., Ogiela, M., & Goodenday, L. S. (2001). Knowledge discovery approach to automated cardiac SPECT diagnosis. Artificial intelligence in medicine, 23(2), 149-169.
Masoudi-Sobhanzadeh, Y., Motieghader, H., & Masoudi-Nejad, A. (2019). FeatureSelect: a software for feature selection based on machine learning approaches. BMC bioinformatics, 20(1), 1-17.
National Development Council (2020). 2018 Mobile Phone Users` Digital Opportunity Survey (AE080006) [data file]. Available from Survey Research Data Archive, Academia Sinica. doi:10.6141/TW-SRDA-AE080006-1
Pearson, K. (1895). VII. Note on regression and inheritance in the case of two parents. proceedings of the royal society of London, 58(347-352), 240-242.
Remeseiro, B., & Bolon-Canedo, V. (2019). A review of feature selection methods in medical applications. Computers in biology and medicine, 112, 103375.
Rodriguez-Galiano, V. F., Luque-Espinar, J. A., Chica-Olmo, M., & Mendes, M. P. (2018). Feature selection approaches for predictive modelling of groundwater nitrate pollution: An evaluation of filters, embedded and wrapper methods. Science of the total environment, 624, 661-672.
Shannon, C. E. (1948). A mathematical theory of communication. The Bell system technical journal, 27(3), 379-423.
Sun, Y., Lu, C., & Li, X. (2018). The cross-entropy based multi-filter ensemble method for gene selection. Genes, 9(5), 258.
Wah, Y. B., Ibrahim, N., Hamid, H. A., Abdul-Rahman, S., & Fong, S. (2018). Feature Selection Methods: Case of Filter and Wrapper Approaches for Maximising Classification Accuracy. Pertanika Journal of Science & Technology, 26(1).
Wang, J., Xu, J., Zhao, C., Peng, Y., & Wang, H. (2019). An ensemble feature selection method for high-dimensional data based on sort aggregation. Systems Science & Control Engineering, 7(2), 32-39.
Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Icml (Vol. 97, No. 412-420, p. 35).
Yöntem, M. K., Kemal, A. D. E. M., Ilhan, T., & KILIÇARSLAN, S. (2019). Divorce prediction using correlation-based feature selection and artificial neural networks. Nevşehir Hacı Bektaş Veli Üniversitesi SBE Dergisi, 9(1), 259-273.
描述 碩士
國立政治大學
統計學系
109354026
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0109354026
資料類型 thesis
dc.contributor.advisor 周珮婷zh_TW
dc.contributor.advisor Chou, Pei-Tingen_US
dc.contributor.author (Authors) 張智鈞zh_TW
dc.contributor.author (Authors) Chang, Chih-Chunen_US
dc.creator (作者) 張智鈞zh_TW
dc.creator (作者) Chang, Chih-Chunen_US
dc.date (日期) 2022en_US
dc.date.accessioned 1-Jul-2022 16:58:28 (UTC+8)-
dc.date.available 1-Jul-2022 16:58:28 (UTC+8)-
dc.date.issued (上傳時間) 1-Jul-2022 16:58:28 (UTC+8)-
dc.identifier (Other Identifiers) G0109354026en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/140755-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 統計學系zh_TW
dc.description (描述) 109354026zh_TW
dc.description.abstract (摘要) 隨著機器學習的蓬勃發展,變數篩選之重要性不言而喻,適當地挑選變數可以優化統計模型的預測表現、降低電腦計算成本以及幫助分析者更加理解資料蘊含的意義。變數篩選主要分為過濾法、包裝法和嵌入法三種,本研究旨在使用一些能夠計算變數相關性的指標,如皮爾森積差相關係數、條件熵、交叉熵、相對熵、Goodman and Kruskal’s τ、克拉瑪V係數等,結合過濾法進行變數篩選,並探討於不同指標下各個資料集的預測表現,亦會比較與原始資料集預測表現的差異。本研究共使用十筆資料進行實驗,包含兩筆模擬資料和八筆真實資料,其中大部分為類別型資料。
在模擬資料中,本研究發現在資料變數為類別型的情況下,條件熵挑選重要變數的能力優於其他指標。在真實資料中,部分資料使用過濾法進行變數篩選後,仍有不錯的預測表現,然而亦有部分資料的預測表現不佳,推測可能和解釋變數之類別個數過多、觀測值過少、資料不平衡以及將連續型變數離散化時轉換不當有關。本研究認為類別個數過多、觀測值過少和資料不平衡的問題可以嘗試透過適當地合併類別去處理,而將連續型變數離散化時可依據原始資料的分配切分。
未來的研究方向應著重於如何針對類別型資料設定挑選變數的門檻,以及是否能將過濾法與包裝法和嵌入法結合出新的演算法,進而更精準地篩選出重要的變數,並提升資料分析的效率。
zh_TW
dc.description.abstract (摘要) With the vigorous development of machine learning, the importance of feature selection is self-evident. Appropriate selection of features can optimize the accuracy of statistical models, reduce computational costs, and help analysts better comprehend the data. Feature selection is mainly divided into filter, wrapper and embedded method, and we put emphasis on filter method. This study implemented filter method by utilizing several indices which can calculate the association of variables such as Pearson correlation coefficient, entropy, Goodman and Kruskal’s τ, Cramer’s V, etc., and we also compared the performance on dimensionally reduced datasets under each index and original datasets. Moreover, we used ten datasets to conduct experiments, including two simulated and eight real datasets, most of which are categorical. Among these simulated datasets, we found that the ability of conditional entropy to select important variables was better than other indices in categorical variables. Among the real datasets, some of them still had good performance while the others did not. We speculated that the poor performance was associated with excessive categories, lacking in observations, imbalanced dataset and improper discretizing continuous variables. We believed that these problems can be solved by advisably merging categories and discretizing continuous variables according to the distribution of the original datasets. Future study should mainly focus on how to moderately set the threshold when filtering variables in categorical datasets, and whether new algorithms can be created by integrating filter, wrapper and embedded methods to enhance the performance on feature selection and improve the efficiency of categorical data analysis.en_US
dc.description.tableofcontents 第壹章 緒論 1
第一節 變數篩選現況 1
第二節 研究動機及目的 2
第貳章 文獻探討 4
第參章 研究方法及資料介紹 6
第一節 使用的指標 6
第二節 使用的演算法 11
第三節 研究資料介紹 12
第肆章 研究流程與結果討論 17
第一節 實驗過程與結果紀錄 17
第二節 實驗結果討論 29
第伍章 結論與建議 34
參考文獻 36
附錄 39
zh_TW
dc.format.extent 6187986 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0109354026en_US
dc.subject (關鍵詞) 變數篩選zh_TW
dc.subject (關鍵詞) 維度縮減zh_TW
dc.subject (關鍵詞) 變數相關性zh_TW
dc.subject (關鍵詞) 過濾法zh_TW
dc.subject (關鍵詞) zh_TW
dc.subject (關鍵詞) 類別型資料zh_TW
dc.subject (關鍵詞) Feature selectionen_US
dc.subject (關鍵詞) Dimension reductionen_US
dc.subject (關鍵詞) Variable associationen_US
dc.subject (關鍵詞) Filter methoden_US
dc.subject (關鍵詞) Entropyen_US
dc.subject (關鍵詞) Categorical datasetsen_US
dc.title (題名) 基於相關性的類別特徵選擇方法之評估zh_TW
dc.title (題名) An Evaluation of Correlation-Based Categorical Feature Selection Methodsen_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) Akoglu, H. (2018). User`s guide to correlation coefficients. Turkish journal of emergency medicine, 18(3), 91-93.
Altman, N. S. (1992). An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3), 175-185.
Belhumeur, P. N., Hespanha, J. P., & Kriegman, D. J. (1997). Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Transactions on pattern analysis and machine intelligence, 19(7), 711-720.
Beh, E. J., & Davy, P. J. (1998). Theory & Methods: Partitioning Pearson’s Chi‐Squared Statistic for a Completely Ordered Three‐Way Contingency Table. Australian & New Zealand Journal of Statistics, 40(4), 465-477.
Boltz, S., Debreuve, E., & Barlaud, M. (2007). kNN-based high-dimensional Kullback-Leibler distance for tracking. In Eighth International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS`07) (pp. 16-16). IEEE.
Boltz, S., Debreuve, E., & Barlaud, M. (2009). High-dimensional statistical measure for region-of-interest tracking. IEEE Transactions on Image Processing, 18(6), 1266-1283.
Bommert, A., Sun, X., Bischl, B., Rahnenführer, J., & Lang, M. (2020). Benchmark for filter methods for feature selection in high-dimensional classification data. Computational Statistics & Data Analysis, 143, 106839.
Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.
Chandrashekar, G., & Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical Engineering, 40(1), 16-28.
Cover, T. M., & Thomas, J. A. (1991). Entropy, relative entropy and mutual information. Elements of information theory, 2(1), 12-13.
Cortez, P., & Silva, A. M. G. (2008). Using data mining to predict secondary school student performance. Proceedings of 5th Future Business Technology Conference (FUBUTEC 2008) pp. 5-12.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273-297.
Cox, D. R. (1958). The regression analysis of binary sequences. Journal of the Royal Statistical Society: Series B (Methodological), 20(2), 215-232.
Cramer, H. (1946). Mathematical methods of statistics, Princeton Univ. Press, Princeton, NJ.
D`Ambra, L., & Lauro, N. (1989). Non symmetrical analysis of three-way contingency tables. In Multiway data analysis (pp. 301-315).
D’Ambra, L., Beh, E. J., & Lombardo, R. (2005). Decomposing Goodman-Kruskal tau for Ordinal Categorical Variables. International Statistical Institute, 55th.
Goodman, L. A., & Kruskal, W. H. (1954). Measures of association for cross classifications. Journal of the American Statistical Association, 49, 732–769.
Gruosso, T., Mieulet, V., Cardon, M., Bourachot, B., Kieffer, Y., Devun, F., ... & Mechta‐Grigoriou, F. (2016). Chronic oxidative stress promotes H2 AX protein degradation and enhances chemosensitivity in breast cancer patients. EMBO molecular medicine, 8(5), 527-549.
Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), 1157-1182.
Guyon, I., Gunn, S., Nikravesh, M., & Zadeh, L. A. (Eds.). (2008). Feature extraction: foundations and applications (Vol. 207). Springer.
Hull, J. J. (1994). A database for handwritten text recognition. IEEE Trans. Pattern Anal. Mach. Intelligence, 16(5), 550-554.
Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The annals of mathematical statistics, 22(1), 79-86.
Kurgan, L. A., Cios, K. J., Tadeusiewicz, R., Ogiela, M., & Goodenday, L. S. (2001). Knowledge discovery approach to automated cardiac SPECT diagnosis. Artificial intelligence in medicine, 23(2), 149-169.
Masoudi-Sobhanzadeh, Y., Motieghader, H., & Masoudi-Nejad, A. (2019). FeatureSelect: a software for feature selection based on machine learning approaches. BMC bioinformatics, 20(1), 1-17.
National Development Council (2020). 2018 Mobile Phone Users` Digital Opportunity Survey (AE080006) [data file]. Available from Survey Research Data Archive, Academia Sinica. doi:10.6141/TW-SRDA-AE080006-1
Pearson, K. (1895). VII. Note on regression and inheritance in the case of two parents. proceedings of the royal society of London, 58(347-352), 240-242.
Remeseiro, B., & Bolon-Canedo, V. (2019). A review of feature selection methods in medical applications. Computers in biology and medicine, 112, 103375.
Rodriguez-Galiano, V. F., Luque-Espinar, J. A., Chica-Olmo, M., & Mendes, M. P. (2018). Feature selection approaches for predictive modelling of groundwater nitrate pollution: An evaluation of filters, embedded and wrapper methods. Science of the total environment, 624, 661-672.
Shannon, C. E. (1948). A mathematical theory of communication. The Bell system technical journal, 27(3), 379-423.
Sun, Y., Lu, C., & Li, X. (2018). The cross-entropy based multi-filter ensemble method for gene selection. Genes, 9(5), 258.
Wah, Y. B., Ibrahim, N., Hamid, H. A., Abdul-Rahman, S., & Fong, S. (2018). Feature Selection Methods: Case of Filter and Wrapper Approaches for Maximising Classification Accuracy. Pertanika Journal of Science & Technology, 26(1).
Wang, J., Xu, J., Zhao, C., Peng, Y., & Wang, H. (2019). An ensemble feature selection method for high-dimensional data based on sort aggregation. Systems Science & Control Engineering, 7(2), 32-39.
Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Icml (Vol. 97, No. 412-420, p. 35).
Yöntem, M. K., Kemal, A. D. E. M., Ilhan, T., & KILIÇARSLAN, S. (2019). Divorce prediction using correlation-based feature selection and artificial neural networks. Nevşehir Hacı Bektaş Veli Üniversitesi SBE Dergisi, 9(1), 259-273.
zh_TW
dc.identifier.doi (DOI) 10.6814/NCCU202200500en_US