以虛擬多類別方式處理不平衡資料

學術產出-Theses

Article View/Open

pdf(0)

Publication Export

Google Scholar^TM

政大圖書館

學術資源探索系統

Citation Infomation

Simple Record
Full Record

題名	以虛擬多類別方式處理不平衡資料 A Virtual Multi-label Approach to Imbalanced Data Classification
作者	楊善評 Yang, Shan-Ping
貢獻者	周珮婷 Chou, Pei-Ting 楊善評 Yang, Shan-Ping
關鍵詞	不平衡資料不平衡分類問題虛擬多類別 Imbalanced data Imbalanced classification problem Virtual multi-label Equal Kmeans
日期	2020
上傳時間	2-Sep-2020 11:41:50 (UTC+8)
摘要	大多數監督式學習方法對於不平衡資料的分類預測，在建構演算法的過程中，會以多數類別當作主要學習對象，因而犧牲少數類別，使分類器的性能下降。基於上述問題，本研究使用一個新的分類方法，結合Equal Kmeans的分群方式，以虛擬多類別來處理不平衡的問題，並且與常用的處理方式，包括抽樣方法中的過度抽樣、低額抽樣及SMOTE；分類器方法中的SVM及One-Class SVM進行比較。研究結果顯示本研究方法隨著資料不平衡程度的上升，會有越好的表現，且逐漸優於其他方法。 To predict the classification of imbalanced data, most of the supervised learning methods will use the majority class as the main learning object to develop a learning algorithm. Therefore, it would lose the information on the minority class and reduce the performance of the classifier. Based on the problem above, a new classification approach with the Equal Kmeans clustering method is proposed in the study. The proposed virtual multi-label approach is used to solve the imbalanced problem. The proposed method is compared with the commonly used imbalance problem methods, such as sampling methods (oversampling, undersampling, and SMOTE) and classifier methods (SVM and One-Class SVM). The result shows that the proposed method will have better performance when the degree of data imbalance increases, and it will gradually outperform other methods.
參考文獻	Akbani, R., Kwek, S., & Japkowicz, N. (2004). Applying support vector machines to imbalanced datasets. Paper presented at the European conference on machine learning. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273-297. Davis, J., & Goadrich, M. (2006). The relationship between Precision-Recall and ROC curves. Paper presented at the Proceedings of the 23rd international conference on Machine learning. Ertekin, S., Huang, J., Bottou, L., & Giles, L. (2007). Learning on the border: active learning in imbalanced data classification. Paper presented at the Proceedings of the sixteenth ACM conference on Conference on information and knowledge management. Ertekin, S., Huang, J., & Giles, C. L. (2007). Active learning for class imbalance problem. Paper presented at the Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. Fawcett, T. (2004). ROC graphs: Notes and practical considerations for researchers. Machine learning, 31(1), 1-38. Fawcett, T. (2006). An introduction to ROC analysis. Pattern recognition letters, 27(8), 861-874. Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. Paper presented at the icml. Fushing, H., & Wang, X. (2020). Coarse- and fine-scale geometric information content of Multiclass Classification and implied Data-driven Intelligence. Proceedings, Machine Learning and Data Mining in Pattern Recognition, Petra Perner (Ed.), 16th International Conference on Machine Learning and Data Mining, MLDM 2020. Han, H., Wang, W.-Y., & Mao, B.-H. (2005). Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. Paper presented at the International conference on intelligent computing. Hand, D. J., & Till, R. J. (2001). A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine learning, 45(2), 171-186. He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. Paper presented at the 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence). He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9), 1263-1284. Hong, X., Chen, S., & Harris, C. J. (2007). A kernel-based two-class classifier for imbalanced data sets. IEEE Transactions on neural networks, 18(1), 28-41. Japkowicz, N. (2001). Supervised versus unsupervised binary-learning by feedforward neural networks. Machine learning, 42(1-2), 97-122. Japkowicz, N., & Stephen, S. (2002). The class imbalance problem: A systematic study. Intelligent data analysis, 6(5), 429-449. Jo, T., & Japkowicz, N. (2004). Class imbalances versus small disjuncts. ACM Sigkdd Explorations Newsletter, 6(1), 40-49. Kang, P., & Cho, S. (2006). EUS SVMs: Ensemble of under-sampled SVMs for data imbalance problems. Paper presented at the International Conference on Neural Information Processing. Kukar, M., & Kononenko, I. (1998). Cost-sensitive learning with neural networks. Paper presented at the ECAI. Lee, H.-j., & Cho, S. (2006). The novelty detection approach for different degrees of class imbalance. Paper presented at the International conference on neural information processing. Liu, X.-Y., Wu, J., & Zhou, Z.-H. (2008). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2), 539-550. Liu, Y., An, A., & Huang, X. (2006). Boosting prediction accuracy on imbalanced datasets with SVM ensembles. Paper presented at the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Maloof, M. A. (2003). Learning when data sets are imbalanced and when costs are unequal and unknown. Paper presented at the ICML-2003 workshop on learning from imbalanced data sets II. Mani, I., & Zhang, I. (2003). kNN approach to unbalanced data distributions: a case study involving information extraction. Paper presented at the Proceedings of workshop on learning from imbalanced datasets. Ramey, J. (2016). Datamicroarray: collection of data sets for classification. In: URL https://github. com/boost-R/datamicroarray. Raskutti, B., & Kowalczyk, A. (2004). Extreme re-balancing for SVMs: a case study. ACM Sigkdd Explorations Newsletter, 6(1), 60-69. Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (2001). Estimating the support of a high-dimensional distribution. Neural computation, 13(7), 1443-1471. Sun, Y., Kamel, M. S., & Wang, Y. (2006). Boosting for learning multiple classes with imbalanced class distribution. Paper presented at the Sixth International Conference on Data Mining (ICDM`06). Sun, Y., Kamel, M. S., Wong, A. K., & Wang, Y. (2007). Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition, 40(12), 3358-3378. Tang, Y., & Zhang, Y.-Q. (2006). Granular SVM with repetitive undersampling for highly imbalanced protein homology prediction. Paper presented at the 2006 IEEE International Conference on Granular Computing. Wang, B. X., & Japkowicz, N. (2008). Boosting support vector machines for imbalanced data sets. Paper presented at the International Symposium on Methodologies for Intelligent Systems. Zou, K. H., O’Malley, A. J., & Mauri, L. (2007). Receiver-operating characteristic analysis for evaluating diagnostic tests and predictive models. Circulation, 115(5), 654-657.
描述	碩士國立政治大學統計學系 107354002
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0107354002
資料類型	thesis

dc.contributor.advisor	周珮婷	zh_TW
dc.contributor.advisor	Chou, Pei-Ting	en_US
dc.contributor.author (Authors)	楊善評	zh_TW
dc.contributor.author (Authors)	Yang, Shan-Ping	en_US
dc.creator (作者)	楊善評	zh_TW
dc.creator (作者)	Yang, Shan-Ping	en_US
dc.date (日期)	2020	en_US
dc.date.accessioned	2-Sep-2020 11:41:50 (UTC+8)	-
dc.date.available	2-Sep-2020 11:41:50 (UTC+8)	-
dc.date.issued (上傳時間)	2-Sep-2020 11:41:50 (UTC+8)	-
dc.identifier (Other Identifiers)	G0107354002	en_US
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/131471	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	統計學系	zh_TW
dc.description (描述)	107354002	zh_TW
dc.description.abstract (摘要)	大多數監督式學習方法對於不平衡資料的分類預測，在建構演算法的過程中，會以多數類別當作主要學習對象，因而犧牲少數類別，使分類器的性能下降。基於上述問題，本研究使用一個新的分類方法，結合Equal Kmeans的分群方式，以虛擬多類別來處理不平衡的問題，並且與常用的處理方式，包括抽樣方法中的過度抽樣、低額抽樣及SMOTE；分類器方法中的SVM及One-Class SVM進行比較。研究結果顯示本研究方法隨著資料不平衡程度的上升，會有越好的表現，且逐漸優於其他方法。	zh_TW
dc.description.abstract (摘要)	To predict the classification of imbalanced data, most of the supervised learning methods will use the majority class as the main learning object to develop a learning algorithm. Therefore, it would lose the information on the minority class and reduce the performance of the classifier. Based on the problem above, a new classification approach with the Equal Kmeans clustering method is proposed in the study. The proposed virtual multi-label approach is used to solve the imbalanced problem. The proposed method is compared with the commonly used imbalance problem methods, such as sampling methods (oversampling, undersampling, and SMOTE) and classifier methods (SVM and One-Class SVM). The result shows that the proposed method will have better performance when the degree of data imbalance increases, and it will gradually outperform other methods.	en_US
dc.description.tableofcontents	第一章緒論 1 第二章文獻探討 2 第一節抽樣方法 2 第二節分類器方法 3 第三節分類評估指標 5 第四節本研究方法 6 第三章研究方法與過程 7 第一節研究方法 7 第二節抽樣方法 11 第三節分類器方法 12 第四節分類評估指標 14 第四章研究結果與分析 18 第一節資料前處理及介紹 18 第二節模擬不平衡資料架構 19 第三節實驗流程 20 第四節模擬不平衡資料架構研究結果 21 第五節資料集研究結果 24 第五章結論與未來展望 30 第一節結論 30 第二節未來展望 33 參考文獻 34	zh_TW
dc.format.extent	2155143 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0107354002	en_US
dc.subject (關鍵詞)	不平衡資料	zh_TW
dc.subject (關鍵詞)	不平衡分類問題	zh_TW
dc.subject (關鍵詞)	虛擬多類別	zh_TW
dc.subject (關鍵詞)	Imbalanced data	en_US
dc.subject (關鍵詞)	Imbalanced classification problem	en_US
dc.subject (關鍵詞)	Virtual multi-label	en_US
dc.subject (關鍵詞)	Equal Kmeans	en_US
dc.title (題名)	以虛擬多類別方式處理不平衡資料	zh_TW
dc.title (題名)	A Virtual Multi-label Approach to Imbalanced Data Classification	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	Akbani, R., Kwek, S., & Japkowicz, N. (2004). Applying support vector machines to imbalanced datasets. Paper presented at the European conference on machine learning. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273-297. Davis, J., & Goadrich, M. (2006). The relationship between Precision-Recall and ROC curves. Paper presented at the Proceedings of the 23rd international conference on Machine learning. Ertekin, S., Huang, J., Bottou, L., & Giles, L. (2007). Learning on the border: active learning in imbalanced data classification. Paper presented at the Proceedings of the sixteenth ACM conference on Conference on information and knowledge management. Ertekin, S., Huang, J., & Giles, C. L. (2007). Active learning for class imbalance problem. Paper presented at the Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. Fawcett, T. (2004). ROC graphs: Notes and practical considerations for researchers. Machine learning, 31(1), 1-38. Fawcett, T. (2006). An introduction to ROC analysis. Pattern recognition letters, 27(8), 861-874. Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. Paper presented at the icml. Fushing, H., & Wang, X. (2020). Coarse- and fine-scale geometric information content of Multiclass Classification and implied Data-driven Intelligence. Proceedings, Machine Learning and Data Mining in Pattern Recognition, Petra Perner (Ed.), 16th International Conference on Machine Learning and Data Mining, MLDM 2020. Han, H., Wang, W.-Y., & Mao, B.-H. (2005). Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. Paper presented at the International conference on intelligent computing. Hand, D. J., & Till, R. J. (2001). A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine learning, 45(2), 171-186. He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. Paper presented at the 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence). He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9), 1263-1284. Hong, X., Chen, S., & Harris, C. J. (2007). A kernel-based two-class classifier for imbalanced data sets. IEEE Transactions on neural networks, 18(1), 28-41. Japkowicz, N. (2001). Supervised versus unsupervised binary-learning by feedforward neural networks. Machine learning, 42(1-2), 97-122. Japkowicz, N., & Stephen, S. (2002). The class imbalance problem: A systematic study. Intelligent data analysis, 6(5), 429-449. Jo, T., & Japkowicz, N. (2004). Class imbalances versus small disjuncts. ACM Sigkdd Explorations Newsletter, 6(1), 40-49. Kang, P., & Cho, S. (2006). EUS SVMs: Ensemble of under-sampled SVMs for data imbalance problems. Paper presented at the International Conference on Neural Information Processing. Kukar, M., & Kononenko, I. (1998). Cost-sensitive learning with neural networks. Paper presented at the ECAI. Lee, H.-j., & Cho, S. (2006). The novelty detection approach for different degrees of class imbalance. Paper presented at the International conference on neural information processing. Liu, X.-Y., Wu, J., & Zhou, Z.-H. (2008). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2), 539-550. Liu, Y., An, A., & Huang, X. (2006). Boosting prediction accuracy on imbalanced datasets with SVM ensembles. Paper presented at the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Maloof, M. A. (2003). Learning when data sets are imbalanced and when costs are unequal and unknown. Paper presented at the ICML-2003 workshop on learning from imbalanced data sets II. Mani, I., & Zhang, I. (2003). kNN approach to unbalanced data distributions: a case study involving information extraction. Paper presented at the Proceedings of workshop on learning from imbalanced datasets. Ramey, J. (2016). Datamicroarray: collection of data sets for classification. In: URL https://github. com/boost-R/datamicroarray. Raskutti, B., & Kowalczyk, A. (2004). Extreme re-balancing for SVMs: a case study. ACM Sigkdd Explorations Newsletter, 6(1), 60-69. Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (2001). Estimating the support of a high-dimensional distribution. Neural computation, 13(7), 1443-1471. Sun, Y., Kamel, M. S., & Wang, Y. (2006). Boosting for learning multiple classes with imbalanced class distribution. Paper presented at the Sixth International Conference on Data Mining (ICDM`06). Sun, Y., Kamel, M. S., Wong, A. K., & Wang, Y. (2007). Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition, 40(12), 3358-3378. Tang, Y., & Zhang, Y.-Q. (2006). Granular SVM with repetitive undersampling for highly imbalanced protein homology prediction. Paper presented at the 2006 IEEE International Conference on Granular Computing. Wang, B. X., & Japkowicz, N. (2008). Boosting support vector machines for imbalanced data sets. Paper presented at the International Symposium on Methodologies for Intelligent Systems. Zou, K. H., O’Malley, A. J., & Mauri, L. (2007). Receiver-operating characteristic analysis for evaluating diagnostic tests and predictive models. Circulation, 115(5), 654-657.	zh_TW
dc.identifier.doi (DOI)	10.6814/NCCU202001477	en_US

學術產出-Theses

Article View/Open

Publication Export

Google ScholarTM

政大圖書館

Citation Infomation

Google Scholar^TM