基於K-Means和因素森林的特徵選取法

學術產出-Theses

Article View/Open

pdf(0)

Publication Export

Google Scholar^TM

政大圖書館

學術資源探索系統

Citation Infomation

No doi shows Citation Infomation

Simple Record
Full Record

題名	基於K-Means和因素森林的特徵選取法 Feature Selection Using Factor-Forest and K-Means
作者	陳劭晏 Chen, Shao-Yan
貢獻者	周珮婷<br>張育瑋 Elizabeth Chou<br>Yu-Wei Chang 陳劭晏 Chen, Shao-Yan
關鍵詞	特徵選取維度縮減集群分析 Feature selection Dimension reduction Clustering Factor-Forest-K-Means
日期	2023
上傳時間	2-Aug-2023 13:04:02 (UTC+8)
摘要	在資料分析的流程中，特徵選取是至關重要的步驟，可以用來從龐大而複雜的資料中篩選出重要的特徵。近年來，許多研究顯示K-Means 演算法不僅能用於進行特徵選取，更可以提升機器學習模型性能，它能夠找出使模型表現有所提升的變數子集。此外，Goretzko & Bühner (2020) 提出了一種名為因素森林 (Factor Forest) 的方法，可用於確定資料中潛在因子的適當數量。在本研究中，我們將提出一種全新的特徵選擇方法，Factor-Forest-K-Means（FFKM），該方法採用 Factor-Forest 作為指標，並透過 K-Means 來篩選變數。它不僅能夠將資料的維度減少約 90%，還能維持模型的準確率。FFKM 具備簡單易使用的特性，並且在本研究中的實驗中整體表現優於其他指標方法和模型，並在其選出的特徵子集上擁有最佳的準確度保留率 (accuracy retention)、降維幅度百分比 (reduction percentage) 和變數準確度保留比例 (Accuracy Retention per Variables)。實驗結果顯示，FFKM 是一種良好的維度縮減方法，能夠在大幅度降低維度的情況下，提升機器學習模型的性能。 Feature selection is a critical step in data analysis to identify important variables from a large number of complex data. Many recent studies have demonstrated that K-Means can be utilized to find a subset of variables that enhances the performance of machine learning models. Another method, Factor Forest (Goretzko & Bühner, 2020), has also been proposed to determine the appropriate number of latent factors in data. In this research, we introduce a new feature selection method using K-Means clustering, called Factor-Forest-K-Means (FFKM), which not only reduces the dimensionality by approximately 90%, but also preserves the predictive accuracy of the original model. The FFKM method is easy to implement and outperforms other index methods and models tested in this study, with the highest accuracy retention, reduction percentage and accuracy retention per variable selected among all methods in different settings. Our results show that FFKM is a promising feature reduction method and can enhance machine learning models’ performance.
參考文獻	Braeken, J., & Van Assen, M. A. (2017). An empirical kaiser criterion. Psychological methods, 22(3), 450. Breiman, L. (2001). Random forests. Machine learning, 45, 5–32. Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785–794). Gini, C. (1921). Measurement of inequality of incomes. The economic journal, 31(121),124–125. Goretzko, D., & Bühner, M. (2020). One model to rule them all? using machine learning algorithms to determine the number of factors in exploratory factor analysis. Psychological Methods, 25(6), 776. Goretzko, D., & Bühner, M. (2022). Factor retention using machine learning with ordinal data. Applied Psychological Measurement, 46(5), 406–421. Hartigan, J. (1975). Clustering algorithms. Wiley. Retrieved from https://books.google.com.tw/books?id=cDnvAAAAMAAJ Horn, J. L. (1965). A rationale and test for the number of factors in factor analysis. Psychometrika, 30, 179–185. Kaiser, H. F. (1960). The application of electronic computers to factor analysis. Educational and psychological measurement, 20(1), 141–151. Khaleel, S. (2011). Feature selection using k-means clustering for data mining. Liaw, A., & Wiener, M. (2002). Classification and regression by randomforest. R News, 2(3), 18-22. Retrieved from https://CRAN.R-project.org/doc/Rnews/ Lloyd, S. (1957). Least square quantization in pcm. bell telephone laboratories paper. published in journal much later: Lloyd, sp: Least squares quantization in pcm. IEEE Trans. Inform. Theor.(1957/1982), 18(11). MacQueen, J. (1967). Classification and analysis of multivariate observations. In 5th berkeley symp. math. statist. probability (pp. 281–297). Parida, K., Mandal, S., Das, S., & Tripathy, A. (2011). Feature extraction using k-means clustering: An approach implementation. Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20, 53–65. Ruscio, J., & Roche, B. (2012). Determining the number of factors to retain in an exploratory factor analysis using comparison data of known factorial structure. Psychological assessment, 24(2), 282. Tang, X., Dong, M., Bi, S., Pei, M., Cao, D., Xie, C., & Chi, S. (2017). Feature selection algorithm based on k-means clustering. In 2017 ieee 7th annual international conference on cyber technology in automation, control, and intelligent systems (cyber) (pp. 1522–1527). Thomas, J., Coors, S., & Bischl, B. (2018). Automatic gradient boosting. arXiv preprint arXiv:1807.03873. Thorndike, R. (1953). Who belongs in the family? Psychometrika, 18(4), 267–276.
描述	碩士國立政治大學統計學系 110354012
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0110354012
資料類型	thesis

dc.contributor.advisor	周珮婷<br>張育瑋	zh_TW
dc.contributor.advisor	Elizabeth Chou<br>Yu-Wei Chang	en_US
dc.contributor.author (Authors)	陳劭晏	zh_TW
dc.contributor.author (Authors)	Chen, Shao-Yan	en_US
dc.creator (作者)	陳劭晏	zh_TW
dc.creator (作者)	Chen, Shao-Yan	en_US
dc.date (日期)	2023	en_US
dc.date.accessioned	2-Aug-2023 13:04:02 (UTC+8)	-
dc.date.available	2-Aug-2023 13:04:02 (UTC+8)	-
dc.date.issued (上傳時間)	2-Aug-2023 13:04:02 (UTC+8)	-
dc.identifier (Other Identifiers)	G0110354012	en_US
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/146306	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	統計學系	zh_TW
dc.description (描述)	110354012	zh_TW
dc.description.abstract (摘要)	在資料分析的流程中，特徵選取是至關重要的步驟，可以用來從龐大而複雜的資料中篩選出重要的特徵。近年來，許多研究顯示K-Means 演算法不僅能用於進行特徵選取，更可以提升機器學習模型性能，它能夠找出使模型表現有所提升的變數子集。此外，Goretzko & Bühner (2020) 提出了一種名為因素森林 (Factor Forest) 的方法，可用於確定資料中潛在因子的適當數量。在本研究中，我們將提出一種全新的特徵選擇方法，Factor-Forest-K-Means（FFKM），該方法採用 Factor-Forest 作為指標，並透過 K-Means 來篩選變數。它不僅能夠將資料的維度減少約 90%，還能維持模型的準確率。FFKM 具備簡單易使用的特性，並且在本研究中的實驗中整體表現優於其他指標方法和模型，並在其選出的特徵子集上擁有最佳的準確度保留率 (accuracy retention)、降維幅度百分比 (reduction percentage) 和變數準確度保留比例 (Accuracy Retention per Variables)。實驗結果顯示，FFKM 是一種良好的維度縮減方法，能夠在大幅度降低維度的情況下，提升機器學習模型的性能。	zh_TW
dc.description.abstract (摘要)	Feature selection is a critical step in data analysis to identify important variables from a large number of complex data. Many recent studies have demonstrated that K-Means can be utilized to find a subset of variables that enhances the performance of machine learning models. Another method, Factor Forest (Goretzko & Bühner, 2020), has also been proposed to determine the appropriate number of latent factors in data. In this research, we introduce a new feature selection method using K-Means clustering, called Factor-Forest-K-Means (FFKM), which not only reduces the dimensionality by approximately 90%, but also preserves the predictive accuracy of the original model. The FFKM method is easy to implement and outperforms other index methods and models tested in this study, with the highest accuracy retention, reduction percentage and accuracy retention per variable selected among all methods in different settings. Our results show that FFKM is a promising feature reduction method and can enhance machine learning models’ performance.	en_US
dc.description.tableofcontents	誌謝 i Acknowledgements ii 摘要 iii Abstract iv Contents v List of Figures vii List of Tables viii 第一章 Introduction 1 第一節 Literature Review 1 第二節 Aim of the study 4 第二章 Methods 5 第一節 K-Means 5 第二節 Factor Forest 6 第三節 Factor Forest K-Means (FFKM) 7 第三章 Experiments 10 第一節 Dataset 10 第二節 Preprocessing and Modeling 11 第三節 Feature table 11 第四節 Evaluation 12 第四章 Result 15 第五章 Conclusions 19 第一節 Future Work 19 References 21	zh_TW
dc.format.extent	461108 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0110354012	en_US
dc.subject (關鍵詞)	特徵選取	zh_TW
dc.subject (關鍵詞)	維度縮減	zh_TW
dc.subject (關鍵詞)	集群分析	zh_TW
dc.subject (關鍵詞)	Feature selection	en_US
dc.subject (關鍵詞)	Dimension reduction	en_US
dc.subject (關鍵詞)	Clustering	en_US
dc.subject (關鍵詞)	Factor-Forest-K-Means	en_US
dc.title (題名)	基於K-Means和因素森林的特徵選取法	zh_TW
dc.title (題名)	Feature Selection Using Factor-Forest and K-Means	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	Braeken, J., & Van Assen, M. A. (2017). An empirical kaiser criterion. Psychological methods, 22(3), 450. Breiman, L. (2001). Random forests. Machine learning, 45, 5–32. Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785–794). Gini, C. (1921). Measurement of inequality of incomes. The economic journal, 31(121),124–125. Goretzko, D., & Bühner, M. (2020). One model to rule them all? using machine learning algorithms to determine the number of factors in exploratory factor analysis. Psychological Methods, 25(6), 776. Goretzko, D., & Bühner, M. (2022). Factor retention using machine learning with ordinal data. Applied Psychological Measurement, 46(5), 406–421. Hartigan, J. (1975). Clustering algorithms. Wiley. Retrieved from https://books.google.com.tw/books?id=cDnvAAAAMAAJ Horn, J. L. (1965). A rationale and test for the number of factors in factor analysis. Psychometrika, 30, 179–185. Kaiser, H. F. (1960). The application of electronic computers to factor analysis. Educational and psychological measurement, 20(1), 141–151. Khaleel, S. (2011). Feature selection using k-means clustering for data mining. Liaw, A., & Wiener, M. (2002). Classification and regression by randomforest. R News, 2(3), 18-22. Retrieved from https://CRAN.R-project.org/doc/Rnews/ Lloyd, S. (1957). Least square quantization in pcm. bell telephone laboratories paper. published in journal much later: Lloyd, sp: Least squares quantization in pcm. IEEE Trans. Inform. Theor.(1957/1982), 18(11). MacQueen, J. (1967). Classification and analysis of multivariate observations. In 5th berkeley symp. math. statist. probability (pp. 281–297). Parida, K., Mandal, S., Das, S., & Tripathy, A. (2011). Feature extraction using k-means clustering: An approach implementation. Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20, 53–65. Ruscio, J., & Roche, B. (2012). Determining the number of factors to retain in an exploratory factor analysis using comparison data of known factorial structure. Psychological assessment, 24(2), 282. Tang, X., Dong, M., Bi, S., Pei, M., Cao, D., Xie, C., & Chi, S. (2017). Feature selection algorithm based on k-means clustering. In 2017 ieee 7th annual international conference on cyber technology in automation, control, and intelligent systems (cyber) (pp. 1522–1527). Thomas, J., Coors, S., & Bischl, B. (2018). Automatic gradient boosting. arXiv preprint arXiv:1807.03873. Thorndike, R. (1953). Who belongs in the family? Psychometrika, 18(4), 267–276.	zh_TW

學術產出-Theses

Article View/Open

Publication Export

Google ScholarTM

政大圖書館

Citation Infomation

Google Scholar^TM