學術產出-Theses

Article View/Open

Publication Export

Google ScholarTM

政大圖書館

Citation Infomation

題名 不平衡資料之數據驅動混合監督式學習方法
Data-driven Hybrid Approach for Imbalanced Data in Supervised Learning
作者 劉得心
Liu, Te-Hsin
貢獻者 周珮婷
Chou, Pei-Ting
劉得心
Liu, Te-Hsin
關鍵詞 不平衡資料
監督式學習
PLR
二元分類問題
Imbalanced data
Supervised learning
PLR
Binary classification
日期 2022
上傳時間 1-Jul-2022 16:58:02 (UTC+8)
摘要 不平衡資料意指資料中有特定類別的樣本個數特別少,造成各類別比例懸殊,此資料特性易使監督式學習的分類模型在訓練時,無法有效地學習少數類別的特徵,導致模型預測錯誤。為解決此問題,本研究嘗試對監督式學習方法Pseudo-Likelihood Ratio(PLR)進行兩種不同的調整,並分別提出調整後的分類模型;為了探討兩種分類模型在不同不平衡比例下的分類效能,本研究將調整後的兩個分類模型與原始PLR、KNN、SVM三個模型,對不同不平衡比例的資料集進行分類預測,以此比較五種模型在不同不平衡比例下的分類效能。最後研究顯示,本研究針對PLR所提出之改善方法,在不同資料集中的表現有所不同,但整體而言,對提升原始PLR分類效能是有所成效的。
Imbalanced data means that the number of specific categories in the data is very small, resulting in a disparity in the proportion of each category. This data characteristic easily makes the supervised learning classification model unable to effectively learn the features of a few categories during training, resulting in model prediction error. In order to solve this problem, this study attempts to make two different adjustments to the supervised learning method Pseudo-Likelihood Ratio (PLR), and propose the adjusted classification models respectively; in order to explore the classification accuracy of the two classification models under various imbalance ratios, the adjusted two classification models and the original PLR, KNN, and SVM were put into each imbalanced proportion of the five data sets for classification, so as to compare the classification performance of the five models. The result shows that the improvement methods proposed in this study for PLR have different performances in different data sets. Still, on the whole, it is effective in improving the classification performance of the original PLR.
參考文獻 Akbani, R., Kwek, S., & Japkowicz, N. (2004). Applying support vector machines to imbalanced datasets. Paper presented at the European conference on machine learning.

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, pp. 321- 357.

Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), pp. 273-297.

Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE. Transactions on Information Theory, 13(1), pp. 21-27

Elizabeth P. Chou & Shan-Ping Yang. (2022). A virtual multi-label approach to imbalanced data classification. Communications in Statistics - Simulation and Computation, DOI: 10.1080/03610918. 2022.2049820

Fushing Hsieh, Elizabeth Chou. (2020). Categorical Exploratory Data Analysis: From Multiclass Classification and Response Manifold Analytics Perspectives of Baseball Pitching Dynamics, Entropy, 23(7), pp. 792, 23, 7-792.

He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions. on knowledge and data engineering, 21(9), pp. 1263-1284.

Hong, X., Chen, S., & Harris, C. J. (2007). A kernel-based two-class classifier for imbalanced data sets. IEEE Transactions on neural networks, 18(1), pp. 28-41.

Seliya, N., Khoshgoftaar, T. M., & Hulse, J. V. (2009). A Study on the Relationships of Classifier Performance Metrics. IEEE International Conference on Tools with Artificial Intelligence, pp. 59-66.

Japkowicz, N., & Stephen, S. (2002). The class imbalance problem: A systematic study. Intelligent data analysis, 6(5), pp. 429-449.

J. B. Brown. (2018). Classifiers and their Metrics Quantified. Molecular Informatics, 37(1-2), p. 1700127.

Kang, P., & Cho, S. (2006). EUS SVMs: Ensemble of under-sampled SVMs for data imbalance problems. Paper presented at the International Conference on Neural Information Processing.

Raskutti, B., & Kowalczyk, A. (2004). Extreme re-balancing for SVMs: a case study. ACM Sigkdd Explorations Newsletter, 6(1), pp. 60-69.

Lee, H.-j., & Cho, S. (2006). The novelty detection approach for different degrees of. class imbalance. Paper presented at the International conference on neural information processing.

Liu, Y., An, A., & Huang, X. (2006). Boosting prediction accuracy on imbalanced datasets with SVM ensembles. Paper presented at the Pacific-Asia Conference on Knowledge Discovery and Data Mining.

Chawla, N.V., Japkowicz, N., & Kolcz, A. (2004). Editorial: Special Issue on. Learning from Imbalanced Data Sets. ACM SIGKDD Explorations Newsletter, 6(1), pp. 1-6.

Tang, Y., & Zhang, Y.-Q. (2006). Granular SVM with repetitive undersampling for highly imbalanced protein homology prediction. Paper presented at the 2006 IEEE International Conference on Granular Computing.

Wang, B. X., & Japkowicz, N. (2008). Boosting support vector machines for imbalanced data sets. Paper presented at the International Symposium on Methodologies for Intelligent Systems.

Fan, W., Stolfo, S.J., Zhang, J., & Chan, P.K. (1999). AdaCost: Misclassification Cost-Sensitive Boosting. Proc. Int’l Conf. Machine Learning, pp. 97-105.

Zou, K. H., O’Malley, A. J., & Mauri, L. (2007). Receiver-operating characteristic analysis for evaluating diagnostic tests and predictive models. Circulation, 115(5), pp. 654-657.
描述 碩士
國立政治大學
統計學系
109354020
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0109354020
資料類型 thesis
dc.contributor.advisor 周珮婷zh_TW
dc.contributor.advisor Chou, Pei-Tingen_US
dc.contributor.author (Authors) 劉得心zh_TW
dc.contributor.author (Authors) Liu, Te-Hsinen_US
dc.creator (作者) 劉得心zh_TW
dc.creator (作者) Liu, Te-Hsinen_US
dc.date (日期) 2022en_US
dc.date.accessioned 1-Jul-2022 16:58:02 (UTC+8)-
dc.date.available 1-Jul-2022 16:58:02 (UTC+8)-
dc.date.issued (上傳時間) 1-Jul-2022 16:58:02 (UTC+8)-
dc.identifier (Other Identifiers) G0109354020en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/140753-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 統計學系zh_TW
dc.description (描述) 109354020zh_TW
dc.description.abstract (摘要) 不平衡資料意指資料中有特定類別的樣本個數特別少,造成各類別比例懸殊,此資料特性易使監督式學習的分類模型在訓練時,無法有效地學習少數類別的特徵,導致模型預測錯誤。為解決此問題,本研究嘗試對監督式學習方法Pseudo-Likelihood Ratio(PLR)進行兩種不同的調整,並分別提出調整後的分類模型;為了探討兩種分類模型在不同不平衡比例下的分類效能,本研究將調整後的兩個分類模型與原始PLR、KNN、SVM三個模型,對不同不平衡比例的資料集進行分類預測,以此比較五種模型在不同不平衡比例下的分類效能。最後研究顯示,本研究針對PLR所提出之改善方法,在不同資料集中的表現有所不同,但整體而言,對提升原始PLR分類效能是有所成效的。zh_TW
dc.description.abstract (摘要) Imbalanced data means that the number of specific categories in the data is very small, resulting in a disparity in the proportion of each category. This data characteristic easily makes the supervised learning classification model unable to effectively learn the features of a few categories during training, resulting in model prediction error. In order to solve this problem, this study attempts to make two different adjustments to the supervised learning method Pseudo-Likelihood Ratio (PLR), and propose the adjusted classification models respectively; in order to explore the classification accuracy of the two classification models under various imbalance ratios, the adjusted two classification models and the original PLR, KNN, and SVM were put into each imbalanced proportion of the five data sets for classification, so as to compare the classification performance of the five models. The result shows that the improvement methods proposed in this study for PLR have different performances in different data sets. Still, on the whole, it is effective in improving the classification performance of the original PLR.en_US
dc.description.tableofcontents 摘要 i
Abstract ii
目 次 iii
表次 v
圖次 vii
第一章 緒論 1
第一節 研究背景 1
第二節 研究目的 1
第二章 文獻回顧 3
第一節 分類模型 3
第二節 評估指標 5
第三節 研究方法 6
第三章 研究方法與過程 7
第一節 分類模型 7
第二節 Pseudo-Likelihood Ratio 9
第三節 研究方法 10
第四節 評估指標 12
第四章 研究結果與分析 16
第一節 資料介紹 16
第二節 不平衡資料模擬 17
第三節 實驗流程 18
第四節 實驗結果與分析 18
第五章 結論與建議 28
第一節 結論 28
第二節 建議 29
參考文獻 30
附錄 32
附錄ㄧ 不平衡資料模擬結果 32
附錄二 Ionosphere資料集實驗結果 35
附錄三 Abalone資料集實驗結果 40
附錄四 Epileptic Seizure Recognition資料集實驗結果 45
附錄五 Wave Database Generator資料集實驗結果 50
附錄六 Credit Card資料集實驗結果 55
zh_TW
dc.format.extent 1280001 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0109354020en_US
dc.subject (關鍵詞) 不平衡資料zh_TW
dc.subject (關鍵詞) 監督式學習zh_TW
dc.subject (關鍵詞) PLRzh_TW
dc.subject (關鍵詞) 二元分類問題zh_TW
dc.subject (關鍵詞) Imbalanced dataen_US
dc.subject (關鍵詞) Supervised learningen_US
dc.subject (關鍵詞) PLRen_US
dc.subject (關鍵詞) Binary classificationen_US
dc.title (題名) 不平衡資料之數據驅動混合監督式學習方法zh_TW
dc.title (題名) Data-driven Hybrid Approach for Imbalanced Data in Supervised Learningen_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) Akbani, R., Kwek, S., & Japkowicz, N. (2004). Applying support vector machines to imbalanced datasets. Paper presented at the European conference on machine learning.

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, pp. 321- 357.

Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), pp. 273-297.

Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE. Transactions on Information Theory, 13(1), pp. 21-27

Elizabeth P. Chou & Shan-Ping Yang. (2022). A virtual multi-label approach to imbalanced data classification. Communications in Statistics - Simulation and Computation, DOI: 10.1080/03610918. 2022.2049820

Fushing Hsieh, Elizabeth Chou. (2020). Categorical Exploratory Data Analysis: From Multiclass Classification and Response Manifold Analytics Perspectives of Baseball Pitching Dynamics, Entropy, 23(7), pp. 792, 23, 7-792.

He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions. on knowledge and data engineering, 21(9), pp. 1263-1284.

Hong, X., Chen, S., & Harris, C. J. (2007). A kernel-based two-class classifier for imbalanced data sets. IEEE Transactions on neural networks, 18(1), pp. 28-41.

Seliya, N., Khoshgoftaar, T. M., & Hulse, J. V. (2009). A Study on the Relationships of Classifier Performance Metrics. IEEE International Conference on Tools with Artificial Intelligence, pp. 59-66.

Japkowicz, N., & Stephen, S. (2002). The class imbalance problem: A systematic study. Intelligent data analysis, 6(5), pp. 429-449.

J. B. Brown. (2018). Classifiers and their Metrics Quantified. Molecular Informatics, 37(1-2), p. 1700127.

Kang, P., & Cho, S. (2006). EUS SVMs: Ensemble of under-sampled SVMs for data imbalance problems. Paper presented at the International Conference on Neural Information Processing.

Raskutti, B., & Kowalczyk, A. (2004). Extreme re-balancing for SVMs: a case study. ACM Sigkdd Explorations Newsletter, 6(1), pp. 60-69.

Lee, H.-j., & Cho, S. (2006). The novelty detection approach for different degrees of. class imbalance. Paper presented at the International conference on neural information processing.

Liu, Y., An, A., & Huang, X. (2006). Boosting prediction accuracy on imbalanced datasets with SVM ensembles. Paper presented at the Pacific-Asia Conference on Knowledge Discovery and Data Mining.

Chawla, N.V., Japkowicz, N., & Kolcz, A. (2004). Editorial: Special Issue on. Learning from Imbalanced Data Sets. ACM SIGKDD Explorations Newsletter, 6(1), pp. 1-6.

Tang, Y., & Zhang, Y.-Q. (2006). Granular SVM with repetitive undersampling for highly imbalanced protein homology prediction. Paper presented at the 2006 IEEE International Conference on Granular Computing.

Wang, B. X., & Japkowicz, N. (2008). Boosting support vector machines for imbalanced data sets. Paper presented at the International Symposium on Methodologies for Intelligent Systems.

Fan, W., Stolfo, S.J., Zhang, J., & Chan, P.K. (1999). AdaCost: Misclassification Cost-Sensitive Boosting. Proc. Int’l Conf. Machine Learning, pp. 97-105.

Zou, K. H., O’Malley, A. J., & Mauri, L. (2007). Receiver-operating characteristic analysis for evaluating diagnostic tests and predictive models. Circulation, 115(5), pp. 654-657.
zh_TW
dc.identifier.doi (DOI) 10.6814/NCCU202200484en_US