學術產出-學位論文
文章檢視/開啟
書目匯出
-
題名 以逐步SVM縮減p大n小資料型態之維度
Dimension reduction of large p small n data set based on stepwise SVM作者 柯子惟
Ko, Tzu Wei貢獻者 周珮婷
柯子惟
Ko, Tzu Wei關鍵詞 維度縮減
特徵選取
p大n小資料型態
逐步SVM
Stepwise SVM
Dimension reduction
Feature selection
Large p small n data set日期 2017 上傳時間 3-七月-2017 14:35:01 (UTC+8) 摘要 本研究目的為p大n小資料型態的維度縮減,提出逐步SVM方法,並與未刪減任何變數之研究資料和主成份分析 (PCA)、皮爾森積差相關係數(PCCs)以及基於隨機森林的遞迴特徵消除(RF-RFE) 維度縮減法進行比較,並探討逐步SVM是否能篩選出較能區別樣本類別的特徵集合。研究資料為六筆疾病相關的基因表現以及生物光譜資料。首先,本研究以監督式學習下使用逐步SVM做特徵選取,從篩選的結果來看,逐步SVM確實能有效從所有變數中萃取出對於樣本的分類上擁有較高重要性之特徵。接著將研究資料分為訓練和測試集,再以半監督式學習下使用逐步SVM、PCA、PCCs和RF-RFE縮減各研究資料之維度,最後配適SVM模型計算預測率,重複以上動作100次取平均當作各維度縮減法的最終預測正確率。觀察計算結果,本研究發現使用逐步SVM所得之預測正確率均優於未處理之原始資料,而與其他方法相比,逐步SVM的穩定度優於PCA和RF-RFE,和PCCs相比則較難看出差異。本研究認為對p大n小資料型態進行維度縮減是必要的,因其能有效消除資料中的雜訊以提升模型整體的預測準確率。 參考文獻 Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., & Levine, A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences, 96(12), 6745-6750. Bellman, R. E. (2015). Adaptive Control Processes: A Guided Tour: Princeton University Press.Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. Paper presented at the Proceedings of the fifth annual workshop on Computational learning theory, Pittsburgh, Pennsylvania, USA. Boulesteix, A.-L. (2004). PLS Dimension Reduction for Classification with Microarray Data Statistical applications in genetics and molecular biology (Vol. 3, pp. 1).Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32. doi:10.1023/a:1010933404324Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297. doi:10.1007/bf00994018Cunningham, P. (2008). Dimension Reduction. In M. Cord & P. Cunningham (Eds.), Machine Learning Techniques for Multimedia: Case Studies on Organization and Retrieval (pp. 91-112). Berlin, Heidelberg: Springer Berlin Heidelberg.Dai, J. J., Lieu, L., & Rocke, D. (2006). Dimension reduction for classification with gene expression microarray data. Statistical applications in genetics and molecular biology, 5(1), 1147. Gordon, G. J., Jensen, R. V., Hsiao, L.-L., Gullans, S. R., Blumenstock, J. E., Ramaswamy, S., . . . Bueno, R. (2002). Translation of Microarray Data into Clinically Relevant Cancer Diagnostic Tests Using Gene Expression Ratios in Lung Cancer and Mesothelioma. Cancer Research, 62(17), 4963-4967. Granitto, P. M., Furlanello, C., Biasioli, F., & Gasperi, F. (2006). Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products. Chemometrics and Intelligent Laboratory Systems, 83(2), 83-90. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), 1157-1182. Guyon, I., Gunn, S. R., Ben-Hur, A., & Dror, G. (2004). Result Analysis of the NIPS 2003 Feature Selection Challenge. Paper presented at the NIPS.Hedenfalk , I., Duggan , D., Chen , Y., Radmacher , M., Bittner , M., Simon , R., . . . Trent , J. (2001). Gene-Expression Profiles in Hereditary Breast Cancer. New England Journal of Medicine, 344(8), 539-548. doi:10.1056/nejm200102223440801Khan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M., Westermann, F., . . . Peterson, C. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature medicine, 7(6), 673-679. Lee Rodgers, J., & Nicewander, W. A. (1988). Thirteen Ways to Look at the Correlation Coefficient. The American Statistician, 42(1), 59-66. doi:10.1080/00031305.1988.10475524Pal, M., & Foody, G. M. (2010). Feature selection for classification of hyperspectral data by SVM. IEEE Transactions on Geoscience and Remote Sensing, 48(5), 2297-2307. Pearson, K. (1901). LIII. On lines and planes of closest fit to systems of points in space. Philosophical Magazine Series 6, 2(11), 559-572. doi:10.1080/14786440109462720Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., & Levine, A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences, 96(12), 6745-6750. Bellman, R. E. (2015). Adaptive Control Processes: A Guided Tour: Princeton University Press.Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. Paper presented at the Proceedings of the fifth annual workshop on Computational learning theory, Pittsburgh, Pennsylvania, USA. Boulesteix, A.-L. (2004). PLS Dimension Reduction for Classification with Microarray Data Statistical applications in genetics and molecular biology (Vol. 3, pp. 1).Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32. doi:10.1023/a:1010933404324Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297. doi:10.1007/bf00994018Cunningham, P. (2008). Dimension Reduction. In M. Cord & P. Cunningham (Eds.), Machine Learning Techniques for Multimedia: Case Studies on Organization and Retrieval (pp. 91-112). Berlin, Heidelberg: Springer Berlin Heidelberg.Dai, J. J., Lieu, L., & Rocke, D. (2006). Dimension reduction for classification with gene expression microarray data. Statistical applications in genetics and molecular biology, 5(1), 1147. Gordon, G. J., Jensen, R. V., Hsiao, L.-L., Gullans, S. R., Blumenstock, J. E., Ramaswamy, S., . . . Bueno, R. (2002). Translation of Microarray Data into Clinically Relevant Cancer Diagnostic Tests Using Gene Expression Ratios in Lung Cancer and Mesothelioma. Cancer Research, 62(17), 4963-4967. Granitto, P. M., Furlanello, C., Biasioli, F., & Gasperi, F. (2006). Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products. Chemometrics and Intelligent Laboratory Systems, 83(2), 83-90. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), 1157-1182. Guyon, I., Gunn, S. R., Ben-Hur, A., & Dror, G. (2004). Result Analysis of the NIPS 2003 Feature Selection Challenge. Paper presented at the NIPS.Hedenfalk , I., Duggan , D., Chen , Y., Radmacher , M., Bittner , M., Simon , R., . . . Trent , J. (2001). Gene-Expression Profiles in Hereditary Breast Cancer. New England Journal of Medicine, 344(8), 539-548. doi:10.1056/nejm200102223440801Khan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M., Westermann, F., . . . Peterson, C. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature medicine, 7(6), 673-679. Lee Rodgers, J., & Nicewander, W. A. (1988). Thirteen Ways to Look at the Correlation Coefficient. The American Statistician, 42(1), 59-66. doi:10.1080/00031305.1988.10475524Pal, M., & Foody, G. M. (2010). Feature selection for classification of hyperspectral data by SVM. IEEE Transactions on Geoscience and Remote Sensing, 48(5), 2297-2307. Pearson, K. (1901). LIII. On lines and planes of closest fit to systems of points in space. Philosophical Magazine Series 6, 2(11), 559-572. doi:10.1080/14786440109462720Shipp, M. A., Ross, K. N., Tamayo, P., Weng, A. P., Kutok, J. L., Aguiar, R. C., . . . Pinkus, G. S. (2002). Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature medicine, 8(1), 68-74. Tang, J., Alelyani, S., & Liu, H. (2014). Feature selection for classification: A review. Data Classification: Algorithms and Applications, 37. Tin Kam, H. (1995, 14-16 Aug 1995). Random decision forests. Paper presented at the Proceedings of 3rd International Conference on Document Analysis and Recognition.Xu, X., & Wang, X. (2005). An Adaptive Network Intrusion Detection Method Based on PCA and Support Vector Machines. In X. Li, S. Wang, & Z. Y. Dong (Eds.), Advanced Data Mining and Applications: First International Conference, ADMA 2005, Wuhan, China, July 22-24, 2005. Proceedings (pp. 696-703). Berlin, Heidelberg: Springer Berlin Heidelberg.Yeung, K. Y., & Ruzzo, W. L. (2001). Principal component analysis for clustering gene expression data. Bioinformatics, 17(9), 763-774. doi:10.1093/bioinformatics/17.9.763林宗勳,Support Vector Machine簡介 描述 碩士
國立政治大學
統計學系
104354021資料來源 http://thesis.lib.nccu.edu.tw/record/#G0104354021 資料類型 thesis dc.contributor.advisor 周珮婷 zh_TW dc.contributor.author (作者) 柯子惟 zh_TW dc.contributor.author (作者) Ko, Tzu Wei en_US dc.creator (作者) 柯子惟 zh_TW dc.creator (作者) Ko, Tzu Wei en_US dc.date (日期) 2017 en_US dc.date.accessioned 3-七月-2017 14:35:01 (UTC+8) - dc.date.available 3-七月-2017 14:35:01 (UTC+8) - dc.date.issued (上傳時間) 3-七月-2017 14:35:01 (UTC+8) - dc.identifier (其他 識別碼) G0104354021 en_US dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/110650 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 統計學系 zh_TW dc.description (描述) 104354021 zh_TW dc.description.abstract (摘要) 本研究目的為p大n小資料型態的維度縮減,提出逐步SVM方法,並與未刪減任何變數之研究資料和主成份分析 (PCA)、皮爾森積差相關係數(PCCs)以及基於隨機森林的遞迴特徵消除(RF-RFE) 維度縮減法進行比較,並探討逐步SVM是否能篩選出較能區別樣本類別的特徵集合。研究資料為六筆疾病相關的基因表現以及生物光譜資料。首先,本研究以監督式學習下使用逐步SVM做特徵選取,從篩選的結果來看,逐步SVM確實能有效從所有變數中萃取出對於樣本的分類上擁有較高重要性之特徵。接著將研究資料分為訓練和測試集,再以半監督式學習下使用逐步SVM、PCA、PCCs和RF-RFE縮減各研究資料之維度,最後配適SVM模型計算預測率,重複以上動作100次取平均當作各維度縮減法的最終預測正確率。觀察計算結果,本研究發現使用逐步SVM所得之預測正確率均優於未處理之原始資料,而與其他方法相比,逐步SVM的穩定度優於PCA和RF-RFE,和PCCs相比則較難看出差異。本研究認為對p大n小資料型態進行維度縮減是必要的,因其能有效消除資料中的雜訊以提升模型整體的預測準確率。 zh_TW dc.description.tableofcontents 第壹章 研究動機及目的 1第一節 高維度小樣本資料維度縮減現況 1第二節 研究動機與目的 2第貳章 文獻探討 3第參章 研究方法及資料 7第一節 逐步SVM 7第二節 所使用之演算法 8第三節 使用的維度縮減法 10第四節 研究資料描述 13第肆章 資料分析與結果 20第一節 實驗過程與分析 20第二節 結果與方法比較 28第伍章 結論與建議 29第一節 結論 29第二節 研究限制與建議 32參考文獻 33 zh_TW dc.format.extent 191267524 bytes - dc.format.mimetype application/pdf - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0104354021 en_US dc.subject (關鍵詞) 維度縮減 zh_TW dc.subject (關鍵詞) 特徵選取 zh_TW dc.subject (關鍵詞) p大n小資料型態 zh_TW dc.subject (關鍵詞) 逐步SVM zh_TW dc.subject (關鍵詞) Stepwise SVM en_US dc.subject (關鍵詞) Dimension reduction en_US dc.subject (關鍵詞) Feature selection en_US dc.subject (關鍵詞) Large p small n data set en_US dc.title (題名) 以逐步SVM縮減p大n小資料型態之維度 zh_TW dc.title (題名) Dimension reduction of large p small n data set based on stepwise SVM en_US dc.type (資料類型) thesis en_US dc.relation.reference (參考文獻) Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., & Levine, A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences, 96(12), 6745-6750. Bellman, R. E. (2015). Adaptive Control Processes: A Guided Tour: Princeton University Press.Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. Paper presented at the Proceedings of the fifth annual workshop on Computational learning theory, Pittsburgh, Pennsylvania, USA. Boulesteix, A.-L. (2004). PLS Dimension Reduction for Classification with Microarray Data Statistical applications in genetics and molecular biology (Vol. 3, pp. 1).Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32. doi:10.1023/a:1010933404324Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297. doi:10.1007/bf00994018Cunningham, P. (2008). Dimension Reduction. In M. Cord & P. Cunningham (Eds.), Machine Learning Techniques for Multimedia: Case Studies on Organization and Retrieval (pp. 91-112). Berlin, Heidelberg: Springer Berlin Heidelberg.Dai, J. J., Lieu, L., & Rocke, D. (2006). Dimension reduction for classification with gene expression microarray data. Statistical applications in genetics and molecular biology, 5(1), 1147. Gordon, G. J., Jensen, R. V., Hsiao, L.-L., Gullans, S. R., Blumenstock, J. E., Ramaswamy, S., . . . Bueno, R. (2002). Translation of Microarray Data into Clinically Relevant Cancer Diagnostic Tests Using Gene Expression Ratios in Lung Cancer and Mesothelioma. Cancer Research, 62(17), 4963-4967. Granitto, P. M., Furlanello, C., Biasioli, F., & Gasperi, F. (2006). Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products. Chemometrics and Intelligent Laboratory Systems, 83(2), 83-90. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), 1157-1182. Guyon, I., Gunn, S. R., Ben-Hur, A., & Dror, G. (2004). Result Analysis of the NIPS 2003 Feature Selection Challenge. Paper presented at the NIPS.Hedenfalk , I., Duggan , D., Chen , Y., Radmacher , M., Bittner , M., Simon , R., . . . Trent , J. (2001). Gene-Expression Profiles in Hereditary Breast Cancer. New England Journal of Medicine, 344(8), 539-548. doi:10.1056/nejm200102223440801Khan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M., Westermann, F., . . . Peterson, C. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature medicine, 7(6), 673-679. Lee Rodgers, J., & Nicewander, W. A. (1988). Thirteen Ways to Look at the Correlation Coefficient. The American Statistician, 42(1), 59-66. doi:10.1080/00031305.1988.10475524Pal, M., & Foody, G. M. (2010). Feature selection for classification of hyperspectral data by SVM. IEEE Transactions on Geoscience and Remote Sensing, 48(5), 2297-2307. Pearson, K. (1901). LIII. On lines and planes of closest fit to systems of points in space. Philosophical Magazine Series 6, 2(11), 559-572. doi:10.1080/14786440109462720Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., & Levine, A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences, 96(12), 6745-6750. Bellman, R. E. (2015). Adaptive Control Processes: A Guided Tour: Princeton University Press.Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. Paper presented at the Proceedings of the fifth annual workshop on Computational learning theory, Pittsburgh, Pennsylvania, USA. Boulesteix, A.-L. (2004). PLS Dimension Reduction for Classification with Microarray Data Statistical applications in genetics and molecular biology (Vol. 3, pp. 1).Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32. doi:10.1023/a:1010933404324Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297. doi:10.1007/bf00994018Cunningham, P. (2008). Dimension Reduction. In M. Cord & P. Cunningham (Eds.), Machine Learning Techniques for Multimedia: Case Studies on Organization and Retrieval (pp. 91-112). Berlin, Heidelberg: Springer Berlin Heidelberg.Dai, J. J., Lieu, L., & Rocke, D. (2006). Dimension reduction for classification with gene expression microarray data. Statistical applications in genetics and molecular biology, 5(1), 1147. Gordon, G. J., Jensen, R. V., Hsiao, L.-L., Gullans, S. R., Blumenstock, J. E., Ramaswamy, S., . . . Bueno, R. (2002). Translation of Microarray Data into Clinically Relevant Cancer Diagnostic Tests Using Gene Expression Ratios in Lung Cancer and Mesothelioma. Cancer Research, 62(17), 4963-4967. Granitto, P. M., Furlanello, C., Biasioli, F., & Gasperi, F. (2006). Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products. Chemometrics and Intelligent Laboratory Systems, 83(2), 83-90. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), 1157-1182. Guyon, I., Gunn, S. R., Ben-Hur, A., & Dror, G. (2004). Result Analysis of the NIPS 2003 Feature Selection Challenge. Paper presented at the NIPS.Hedenfalk , I., Duggan , D., Chen , Y., Radmacher , M., Bittner , M., Simon , R., . . . Trent , J. (2001). Gene-Expression Profiles in Hereditary Breast Cancer. New England Journal of Medicine, 344(8), 539-548. doi:10.1056/nejm200102223440801Khan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M., Westermann, F., . . . Peterson, C. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature medicine, 7(6), 673-679. Lee Rodgers, J., & Nicewander, W. A. (1988). Thirteen Ways to Look at the Correlation Coefficient. The American Statistician, 42(1), 59-66. doi:10.1080/00031305.1988.10475524Pal, M., & Foody, G. M. (2010). Feature selection for classification of hyperspectral data by SVM. IEEE Transactions on Geoscience and Remote Sensing, 48(5), 2297-2307. Pearson, K. (1901). LIII. On lines and planes of closest fit to systems of points in space. Philosophical Magazine Series 6, 2(11), 559-572. doi:10.1080/14786440109462720Shipp, M. A., Ross, K. N., Tamayo, P., Weng, A. P., Kutok, J. L., Aguiar, R. C., . . . Pinkus, G. S. (2002). Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature medicine, 8(1), 68-74. Tang, J., Alelyani, S., & Liu, H. (2014). Feature selection for classification: A review. Data Classification: Algorithms and Applications, 37. Tin Kam, H. (1995, 14-16 Aug 1995). Random decision forests. Paper presented at the Proceedings of 3rd International Conference on Document Analysis and Recognition.Xu, X., & Wang, X. (2005). An Adaptive Network Intrusion Detection Method Based on PCA and Support Vector Machines. In X. Li, S. Wang, & Z. Y. Dong (Eds.), Advanced Data Mining and Applications: First International Conference, ADMA 2005, Wuhan, China, July 22-24, 2005. Proceedings (pp. 696-703). Berlin, Heidelberg: Springer Berlin Heidelberg.Yeung, K. Y., & Ruzzo, W. L. (2001). Principal component analysis for clustering gene expression data. Bioinformatics, 17(9), 763-774. doi:10.1093/bioinformatics/17.9.763林宗勳,Support Vector Machine簡介 zh_TW