學術產出-Theses

Article View/Open

Publication Export

Google ScholarTM

政大圖書館

Citation Infomation

題名 以虛擬多類別方式處理不平衡資料
A Virtual Multi-label Approach to Imbalanced Data Classification
作者 楊善評
Yang, Shan-Ping
貢獻者 周珮婷
Chou, Pei-Ting
楊善評
Yang, Shan-Ping
關鍵詞 不平衡資料
不平衡分類問題
虛擬多類別
Imbalanced data
Imbalanced classification problem
Virtual multi-label
Equal Kmeans
日期 2020
上傳時間 2-Sep-2020 11:41:50 (UTC+8)
摘要 大多數監督式學習方法對於不平衡資料的分類預測,在建構演算法的過程中,會以多數類別當作主要學習對象,因而犧牲少數類別,使分類器的性能下降。基於上述問題,本研究使用一個新的分類方法,結合Equal Kmeans的分群方式,以虛擬多類別來處理不平衡的問題,並且與常用的處理方式,包括抽樣方法中的過度抽樣、低額抽樣及SMOTE;分類器方法中的SVM及One-Class SVM進行比較。研究結果顯示本研究方法隨著資料不平衡程度的上升,會有越好的表現,且逐漸優於其他方法。
To predict the classification of imbalanced data, most of the supervised learning methods will use the majority class as the main learning object to develop a learning algorithm. Therefore, it would lose the information on the minority class and reduce the performance of the classifier. Based on the problem above, a new classification approach with the Equal Kmeans clustering method is proposed in the study. The proposed virtual multi-label approach is used to solve the imbalanced problem. The proposed method is compared with the commonly used imbalance problem methods, such as sampling methods (oversampling, undersampling, and SMOTE) and classifier methods (SVM and One-Class SVM). The result shows that the proposed method will have better performance when the degree of data imbalance increases, and it will gradually outperform other methods.
參考文獻 Akbani, R., Kwek, S., & Japkowicz, N. (2004). Applying support vector machines to imbalanced datasets. Paper presented at the European conference on machine learning.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273-297.
Davis, J., & Goadrich, M. (2006). The relationship between Precision-Recall and ROC curves. Paper presented at the Proceedings of the 23rd international conference on Machine learning.
Ertekin, S., Huang, J., Bottou, L., & Giles, L. (2007). Learning on the border: active learning in imbalanced data classification. Paper presented at the Proceedings of the sixteenth ACM conference on Conference on information and knowledge management.
Ertekin, S., Huang, J., & Giles, C. L. (2007). Active learning for class imbalance problem. Paper presented at the Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval.
Fawcett, T. (2004). ROC graphs: Notes and practical considerations for researchers. Machine learning, 31(1), 1-38.
Fawcett, T. (2006). An introduction to ROC analysis. Pattern recognition letters, 27(8), 861-874.
Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. Paper presented at the icml.
Fushing, H., & Wang, X. (2020). Coarse- and fine-scale geometric information content of Multiclass Classification and implied Data-driven Intelligence. Proceedings, Machine Learning and Data Mining in Pattern Recognition, Petra Perner (Ed.), 16th International Conference on Machine Learning and Data Mining, MLDM 2020.
Han, H., Wang, W.-Y., & Mao, B.-H. (2005). Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. Paper presented at the International conference on intelligent computing.
Hand, D. J., & Till, R. J. (2001). A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine learning, 45(2), 171-186.
He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. Paper presented at the 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence).
He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9), 1263-1284.
Hong, X., Chen, S., & Harris, C. J. (2007). A kernel-based two-class classifier for imbalanced data sets. IEEE Transactions on neural networks, 18(1), 28-41.
Japkowicz, N. (2001). Supervised versus unsupervised binary-learning by feedforward neural networks. Machine learning, 42(1-2), 97-122.
Japkowicz, N., & Stephen, S. (2002). The class imbalance problem: A systematic study. Intelligent data analysis, 6(5), 429-449.
Jo, T., & Japkowicz, N. (2004). Class imbalances versus small disjuncts. ACM Sigkdd Explorations Newsletter, 6(1), 40-49.
Kang, P., & Cho, S. (2006). EUS SVMs: Ensemble of under-sampled SVMs for data imbalance problems. Paper presented at the International Conference on Neural Information Processing.
Kukar, M., & Kononenko, I. (1998). Cost-sensitive learning with neural networks. Paper presented at the ECAI.
Lee, H.-j., & Cho, S. (2006). The novelty detection approach for different degrees of class imbalance. Paper presented at the International conference on neural information processing.
Liu, X.-Y., Wu, J., & Zhou, Z.-H. (2008). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2), 539-550.
Liu, Y., An, A., & Huang, X. (2006). Boosting prediction accuracy on imbalanced datasets with SVM ensembles. Paper presented at the Pacific-Asia Conference on Knowledge Discovery and Data Mining.
Maloof, M. A. (2003). Learning when data sets are imbalanced and when costs are unequal and unknown. Paper presented at the ICML-2003 workshop on learning from imbalanced data sets II.
Mani, I., & Zhang, I. (2003). kNN approach to unbalanced data distributions: a case study involving information extraction. Paper presented at the Proceedings of workshop on learning from imbalanced datasets.
Ramey, J. (2016). Datamicroarray: collection of data sets for classification. In: URL https://github. com/boost-R/datamicroarray.
Raskutti, B., & Kowalczyk, A. (2004). Extreme re-balancing for SVMs: a case study. ACM Sigkdd Explorations Newsletter, 6(1), 60-69.
Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (2001). Estimating the support of a high-dimensional distribution. Neural computation, 13(7), 1443-1471.
Sun, Y., Kamel, M. S., & Wang, Y. (2006). Boosting for learning multiple classes with imbalanced class distribution. Paper presented at the Sixth International Conference on Data Mining (ICDM`06).
Sun, Y., Kamel, M. S., Wong, A. K., & Wang, Y. (2007). Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition, 40(12), 3358-3378.
Tang, Y., & Zhang, Y.-Q. (2006). Granular SVM with repetitive undersampling for highly imbalanced protein homology prediction. Paper presented at the 2006 IEEE International Conference on Granular Computing.
Wang, B. X., & Japkowicz, N. (2008). Boosting support vector machines for imbalanced data sets. Paper presented at the International Symposium on Methodologies for Intelligent Systems.
Zou, K. H., O’Malley, A. J., & Mauri, L. (2007). Receiver-operating characteristic analysis for evaluating diagnostic tests and predictive models. Circulation, 115(5), 654-657.
描述 碩士
國立政治大學
統計學系
107354002
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0107354002
資料類型 thesis
dc.contributor.advisor 周珮婷zh_TW
dc.contributor.advisor Chou, Pei-Tingen_US
dc.contributor.author (Authors) 楊善評zh_TW
dc.contributor.author (Authors) Yang, Shan-Pingen_US
dc.creator (作者) 楊善評zh_TW
dc.creator (作者) Yang, Shan-Pingen_US
dc.date (日期) 2020en_US
dc.date.accessioned 2-Sep-2020 11:41:50 (UTC+8)-
dc.date.available 2-Sep-2020 11:41:50 (UTC+8)-
dc.date.issued (上傳時間) 2-Sep-2020 11:41:50 (UTC+8)-
dc.identifier (Other Identifiers) G0107354002en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/131471-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 統計學系zh_TW
dc.description (描述) 107354002zh_TW
dc.description.abstract (摘要) 大多數監督式學習方法對於不平衡資料的分類預測,在建構演算法的過程中,會以多數類別當作主要學習對象,因而犧牲少數類別,使分類器的性能下降。基於上述問題,本研究使用一個新的分類方法,結合Equal Kmeans的分群方式,以虛擬多類別來處理不平衡的問題,並且與常用的處理方式,包括抽樣方法中的過度抽樣、低額抽樣及SMOTE;分類器方法中的SVM及One-Class SVM進行比較。研究結果顯示本研究方法隨著資料不平衡程度的上升,會有越好的表現,且逐漸優於其他方法。zh_TW
dc.description.abstract (摘要) To predict the classification of imbalanced data, most of the supervised learning methods will use the majority class as the main learning object to develop a learning algorithm. Therefore, it would lose the information on the minority class and reduce the performance of the classifier. Based on the problem above, a new classification approach with the Equal Kmeans clustering method is proposed in the study. The proposed virtual multi-label approach is used to solve the imbalanced problem. The proposed method is compared with the commonly used imbalance problem methods, such as sampling methods (oversampling, undersampling, and SMOTE) and classifier methods (SVM and One-Class SVM). The result shows that the proposed method will have better performance when the degree of data imbalance increases, and it will gradually outperform other methods.en_US
dc.description.tableofcontents 第一章 緒論 1
第二章 文獻探討 2
第一節 抽樣方法 2
第二節 分類器方法 3
第三節 分類評估指標 5
第四節 本研究方法 6
第三章 研究方法與過程 7
第一節 研究方法 7
第二節 抽樣方法 11
第三節 分類器方法 12
第四節 分類評估指標 14
第四章 研究結果與分析 18
第一節 資料前處理及介紹 18
第二節 模擬不平衡資料架構 19
第三節 實驗流程 20
第四節 模擬不平衡資料架構研究結果 21
第五節 資料集研究結果 24
第五章 結論與未來展望 30
第一節 結論 30
第二節 未來展望 33
參考文獻 34
zh_TW
dc.format.extent 2155143 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0107354002en_US
dc.subject (關鍵詞) 不平衡資料zh_TW
dc.subject (關鍵詞) 不平衡分類問題zh_TW
dc.subject (關鍵詞) 虛擬多類別zh_TW
dc.subject (關鍵詞) Imbalanced dataen_US
dc.subject (關鍵詞) Imbalanced classification problemen_US
dc.subject (關鍵詞) Virtual multi-labelen_US
dc.subject (關鍵詞) Equal Kmeansen_US
dc.title (題名) 以虛擬多類別方式處理不平衡資料zh_TW
dc.title (題名) A Virtual Multi-label Approach to Imbalanced Data Classificationen_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) Akbani, R., Kwek, S., & Japkowicz, N. (2004). Applying support vector machines to imbalanced datasets. Paper presented at the European conference on machine learning.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273-297.
Davis, J., & Goadrich, M. (2006). The relationship between Precision-Recall and ROC curves. Paper presented at the Proceedings of the 23rd international conference on Machine learning.
Ertekin, S., Huang, J., Bottou, L., & Giles, L. (2007). Learning on the border: active learning in imbalanced data classification. Paper presented at the Proceedings of the sixteenth ACM conference on Conference on information and knowledge management.
Ertekin, S., Huang, J., & Giles, C. L. (2007). Active learning for class imbalance problem. Paper presented at the Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval.
Fawcett, T. (2004). ROC graphs: Notes and practical considerations for researchers. Machine learning, 31(1), 1-38.
Fawcett, T. (2006). An introduction to ROC analysis. Pattern recognition letters, 27(8), 861-874.
Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. Paper presented at the icml.
Fushing, H., & Wang, X. (2020). Coarse- and fine-scale geometric information content of Multiclass Classification and implied Data-driven Intelligence. Proceedings, Machine Learning and Data Mining in Pattern Recognition, Petra Perner (Ed.), 16th International Conference on Machine Learning and Data Mining, MLDM 2020.
Han, H., Wang, W.-Y., & Mao, B.-H. (2005). Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. Paper presented at the International conference on intelligent computing.
Hand, D. J., & Till, R. J. (2001). A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine learning, 45(2), 171-186.
He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. Paper presented at the 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence).
He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9), 1263-1284.
Hong, X., Chen, S., & Harris, C. J. (2007). A kernel-based two-class classifier for imbalanced data sets. IEEE Transactions on neural networks, 18(1), 28-41.
Japkowicz, N. (2001). Supervised versus unsupervised binary-learning by feedforward neural networks. Machine learning, 42(1-2), 97-122.
Japkowicz, N., & Stephen, S. (2002). The class imbalance problem: A systematic study. Intelligent data analysis, 6(5), 429-449.
Jo, T., & Japkowicz, N. (2004). Class imbalances versus small disjuncts. ACM Sigkdd Explorations Newsletter, 6(1), 40-49.
Kang, P., & Cho, S. (2006). EUS SVMs: Ensemble of under-sampled SVMs for data imbalance problems. Paper presented at the International Conference on Neural Information Processing.
Kukar, M., & Kononenko, I. (1998). Cost-sensitive learning with neural networks. Paper presented at the ECAI.
Lee, H.-j., & Cho, S. (2006). The novelty detection approach for different degrees of class imbalance. Paper presented at the International conference on neural information processing.
Liu, X.-Y., Wu, J., & Zhou, Z.-H. (2008). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2), 539-550.
Liu, Y., An, A., & Huang, X. (2006). Boosting prediction accuracy on imbalanced datasets with SVM ensembles. Paper presented at the Pacific-Asia Conference on Knowledge Discovery and Data Mining.
Maloof, M. A. (2003). Learning when data sets are imbalanced and when costs are unequal and unknown. Paper presented at the ICML-2003 workshop on learning from imbalanced data sets II.
Mani, I., & Zhang, I. (2003). kNN approach to unbalanced data distributions: a case study involving information extraction. Paper presented at the Proceedings of workshop on learning from imbalanced datasets.
Ramey, J. (2016). Datamicroarray: collection of data sets for classification. In: URL https://github. com/boost-R/datamicroarray.
Raskutti, B., & Kowalczyk, A. (2004). Extreme re-balancing for SVMs: a case study. ACM Sigkdd Explorations Newsletter, 6(1), 60-69.
Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (2001). Estimating the support of a high-dimensional distribution. Neural computation, 13(7), 1443-1471.
Sun, Y., Kamel, M. S., & Wang, Y. (2006). Boosting for learning multiple classes with imbalanced class distribution. Paper presented at the Sixth International Conference on Data Mining (ICDM`06).
Sun, Y., Kamel, M. S., Wong, A. K., & Wang, Y. (2007). Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition, 40(12), 3358-3378.
Tang, Y., & Zhang, Y.-Q. (2006). Granular SVM with repetitive undersampling for highly imbalanced protein homology prediction. Paper presented at the 2006 IEEE International Conference on Granular Computing.
Wang, B. X., & Japkowicz, N. (2008). Boosting support vector machines for imbalanced data sets. Paper presented at the International Symposium on Methodologies for Intelligent Systems.
Zou, K. H., O’Malley, A. J., & Mauri, L. (2007). Receiver-operating characteristic analysis for evaluating diagnostic tests and predictive models. Circulation, 115(5), 654-657.
zh_TW
dc.identifier.doi (DOI) 10.6814/NCCU202001477en_US