以逐步SVM縮減p大n小資料型態之維度

Publications-Theses

Article View/Open

pdf(163)

Publication Export

Google Scholar^TM

題名	以逐步SVM縮減p大n小資料型態之維度 Dimension reduction of large p small n data set based on stepwise SVM
作者	柯子惟 Ko, Tzu Wei
貢獻者	周珮婷柯子惟 Ko, Tzu Wei
關鍵詞	維度縮減特徵選取 p大n小資料型態逐步SVM Stepwise SVM Dimension reduction Feature selection Large p small n data set
日期	2017
上傳時間	3-Jul-2017 14:35:01 (UTC+8)
摘要	本研究目的為p大n小資料型態的維度縮減，提出逐步SVM方法，並與未刪減任何變數之研究資料和主成份分析 (PCA)、皮爾森積差相關係數(PCCs)以及基於隨機森林的遞迴特徵消除(RF-RFE) 維度縮減法進行比較，並探討逐步SVM是否能篩選出較能區別樣本類別的特徵集合。研究資料為六筆疾病相關的基因表現以及生物光譜資料。首先，本研究以監督式學習下使用逐步SVM做特徵選取，從篩選的結果來看，逐步SVM確實能有效從所有變數中萃取出對於樣本的分類上擁有較高重要性之特徵。接著將研究資料分為訓練和測試集，再以半監督式學習下使用逐步SVM、PCA、PCCs和RF-RFE縮減各研究資料之維度，最後配適SVM模型計算預測率，重複以上動作100次取平均當作各維度縮減法的最終預測正確率。觀察計算結果，本研究發現使用逐步SVM所得之預測正確率均優於未處理之原始資料，而與其他方法相比，逐步SVM的穩定度優於PCA和RF-RFE，和PCCs相比則較難看出差異。本研究認為對p大n小資料型態進行維度縮減是必要的，因其能有效消除資料中的雜訊以提升模型整體的預測準確率。
參考文獻	Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., & Levine, A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences, 96(12), 6745-6750. Bellman, R. E. (2015). Adaptive Control Processes: A Guided Tour: Princeton University Press. Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. Paper presented at the Proceedings of the fifth annual workshop on Computational learning theory, Pittsburgh, Pennsylvania, USA. Boulesteix, A.-L. (2004). PLS Dimension Reduction for Classification with Microarray Data Statistical applications in genetics and molecular biology (Vol. 3, pp. 1). Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32. doi:10.1023/a:1010933404324 Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297. doi:10.1007/bf00994018 Cunningham, P. (2008). Dimension Reduction. In M. Cord & P. Cunningham (Eds.), Machine Learning Techniques for Multimedia: Case Studies on Organization and Retrieval (pp. 91-112). Berlin, Heidelberg: Springer Berlin Heidelberg. Dai, J. J., Lieu, L., & Rocke, D. (2006). Dimension reduction for classification with gene expression microarray data. Statistical applications in genetics and molecular biology, 5(1), 1147. Gordon, G. J., Jensen, R. V., Hsiao, L.-L., Gullans, S. R., Blumenstock, J. E., Ramaswamy, S., . . . Bueno, R. (2002). Translation of Microarray Data into Clinically Relevant Cancer Diagnostic Tests Using Gene Expression Ratios in Lung Cancer and Mesothelioma. Cancer Research, 62(17), 4963-4967. Granitto, P. M., Furlanello, C., Biasioli, F., & Gasperi, F. (2006). Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products. Chemometrics and Intelligent Laboratory Systems, 83(2), 83-90. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), 1157-1182. Guyon, I., Gunn, S. R., Ben-Hur, A., & Dror, G. (2004). Result Analysis of the NIPS 2003 Feature Selection Challenge. Paper presented at the NIPS. Hedenfalk , I., Duggan , D., Chen , Y., Radmacher , M., Bittner , M., Simon , R., . . . Trent , J. (2001). Gene-Expression Profiles in Hereditary Breast Cancer. New England Journal of Medicine, 344(8), 539-548. doi:10.1056/nejm200102223440801 Khan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M., Westermann, F., . . . Peterson, C. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature medicine, 7(6), 673-679. Lee Rodgers, J., & Nicewander, W. A. (1988). Thirteen Ways to Look at the Correlation Coefficient. The American Statistician, 42(1), 59-66. doi:10.1080/00031305.1988.10475524 Pal, M., & Foody, G. M. (2010). Feature selection for classification of hyperspectral data by SVM. IEEE Transactions on Geoscience and Remote Sensing, 48(5), 2297-2307. Pearson, K. (1901). LIII. On lines and planes of closest fit to systems of points in space. Philosophical Magazine Series 6, 2(11), 559-572. doi:10.1080/14786440109462720 Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., & Levine, A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences, 96(12), 6745-6750. Bellman, R. E. (2015). Adaptive Control Processes: A Guided Tour: Princeton University Press. Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. Paper presented at the Proceedings of the fifth annual workshop on Computational learning theory, Pittsburgh, Pennsylvania, USA. Boulesteix, A.-L. (2004). PLS Dimension Reduction for Classification with Microarray Data Statistical applications in genetics and molecular biology (Vol. 3, pp. 1). Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32. doi:10.1023/a:1010933404324 Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297. doi:10.1007/bf00994018 Cunningham, P. (2008). Dimension Reduction. In M. Cord & P. Cunningham (Eds.), Machine Learning Techniques for Multimedia: Case Studies on Organization and Retrieval (pp. 91-112). Berlin, Heidelberg: Springer Berlin Heidelberg. Dai, J. J., Lieu, L., & Rocke, D. (2006). Dimension reduction for classification with gene expression microarray data. Statistical applications in genetics and molecular biology, 5(1), 1147. Gordon, G. J., Jensen, R. V., Hsiao, L.-L., Gullans, S. R., Blumenstock, J. E., Ramaswamy, S., . . . Bueno, R. (2002). Translation of Microarray Data into Clinically Relevant Cancer Diagnostic Tests Using Gene Expression Ratios in Lung Cancer and Mesothelioma. Cancer Research, 62(17), 4963-4967. Granitto, P. M., Furlanello, C., Biasioli, F., & Gasperi, F. (2006). Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products. Chemometrics and Intelligent Laboratory Systems, 83(2), 83-90. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), 1157-1182. Guyon, I., Gunn, S. R., Ben-Hur, A., & Dror, G. (2004). Result Analysis of the NIPS 2003 Feature Selection Challenge. Paper presented at the NIPS. Hedenfalk , I., Duggan , D., Chen , Y., Radmacher , M., Bittner , M., Simon , R., . . . Trent , J. (2001). Gene-Expression Profiles in Hereditary Breast Cancer. New England Journal of Medicine, 344(8), 539-548. doi:10.1056/nejm200102223440801 Khan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M., Westermann, F., . . . Peterson, C. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature medicine, 7(6), 673-679. Lee Rodgers, J., & Nicewander, W. A. (1988). Thirteen Ways to Look at the Correlation Coefficient. The American Statistician, 42(1), 59-66. doi:10.1080/00031305.1988.10475524 Pal, M., & Foody, G. M. (2010). Feature selection for classification of hyperspectral data by SVM. IEEE Transactions on Geoscience and Remote Sensing, 48(5), 2297-2307. Pearson, K. (1901). LIII. On lines and planes of closest fit to systems of points in space. Philosophical Magazine Series 6, 2(11), 559-572. doi:10.1080/14786440109462720 Shipp, M. A., Ross, K. N., Tamayo, P., Weng, A. P., Kutok, J. L., Aguiar, R. C., . . . Pinkus, G. S. (2002). Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature medicine, 8(1), 68-74. Tang, J., Alelyani, S., & Liu, H. (2014). Feature selection for classification: A review. Data Classification: Algorithms and Applications, 37. Tin Kam, H. (1995, 14-16 Aug 1995). Random decision forests. Paper presented at the Proceedings of 3rd International Conference on Document Analysis and Recognition. Xu, X., & Wang, X. (2005). An Adaptive Network Intrusion Detection Method Based on PCA and Support Vector Machines. In X. Li, S. Wang, & Z. Y. Dong (Eds.), Advanced Data Mining and Applications: First International Conference, ADMA 2005, Wuhan, China, July 22-24, 2005. Proceedings (pp. 696-703). Berlin, Heidelberg: Springer Berlin Heidelberg. Yeung, K. Y., & Ruzzo, W. L. (2001). Principal component analysis for clustering gene expression data. Bioinformatics, 17(9), 763-774. doi:10.1093/bioinformatics/17.9.763 林宗勳，Support Vector Machine簡介
描述	碩士國立政治大學統計學系 104354021
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0104354021
資料類型	thesis

dc.contributor.advisor	周珮婷	zh_TW
dc.contributor.author (Authors)	柯子惟	zh_TW
dc.contributor.author (Authors)	Ko, Tzu Wei	en_US
dc.creator (作者)	柯子惟	zh_TW
dc.creator (作者)	Ko, Tzu Wei	en_US
dc.date (日期)	2017	en_US
dc.date.accessioned	3-Jul-2017 14:35:01 (UTC+8)	-
dc.date.available	3-Jul-2017 14:35:01 (UTC+8)	-
dc.date.issued (上傳時間)	3-Jul-2017 14:35:01 (UTC+8)	-
dc.identifier (Other Identifiers)	G0104354021	en_US
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/110650	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	統計學系	zh_TW
dc.description (描述)	104354021	zh_TW
dc.description.abstract (摘要)	本研究目的為p大n小資料型態的維度縮減，提出逐步SVM方法，並與未刪減任何變數之研究資料和主成份分析 (PCA)、皮爾森積差相關係數(PCCs)以及基於隨機森林的遞迴特徵消除(RF-RFE) 維度縮減法進行比較，並探討逐步SVM是否能篩選出較能區別樣本類別的特徵集合。研究資料為六筆疾病相關的基因表現以及生物光譜資料。首先，本研究以監督式學習下使用逐步SVM做特徵選取，從篩選的結果來看，逐步SVM確實能有效從所有變數中萃取出對於樣本的分類上擁有較高重要性之特徵。接著將研究資料分為訓練和測試集，再以半監督式學習下使用逐步SVM、PCA、PCCs和RF-RFE縮減各研究資料之維度，最後配適SVM模型計算預測率，重複以上動作100次取平均當作各維度縮減法的最終預測正確率。觀察計算結果，本研究發現使用逐步SVM所得之預測正確率均優於未處理之原始資料，而與其他方法相比，逐步SVM的穩定度優於PCA和RF-RFE，和PCCs相比則較難看出差異。本研究認為對p大n小資料型態進行維度縮減是必要的，因其能有效消除資料中的雜訊以提升模型整體的預測準確率。	zh_TW
dc.description.tableofcontents	第壹章研究動機及目的 1 第一節高維度小樣本資料維度縮減現況 1 第二節研究動機與目的 2 第貳章文獻探討 3 第參章研究方法及資料 7 第一節逐步SVM 7 第二節所使用之演算法 8 第三節使用的維度縮減法 10 第四節研究資料描述 13 第肆章資料分析與結果 20 第一節實驗過程與分析 20 第二節結果與方法比較 28 第伍章結論與建議 29 第一節結論 29 第二節研究限制與建議 32 參考文獻 33	zh_TW
dc.format.extent	191267524 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0104354021	en_US
dc.subject (關鍵詞)	維度縮減	zh_TW
dc.subject (關鍵詞)	特徵選取	zh_TW
dc.subject (關鍵詞)	p大n小資料型態	zh_TW
dc.subject (關鍵詞)	逐步SVM	zh_TW
dc.subject (關鍵詞)	Stepwise SVM	en_US
dc.subject (關鍵詞)	Dimension reduction	en_US
dc.subject (關鍵詞)	Feature selection	en_US
dc.subject (關鍵詞)	Large p small n data set	en_US
dc.title (題名)	以逐步SVM縮減p大n小資料型態之維度	zh_TW
dc.title (題名)	Dimension reduction of large p small n data set based on stepwise SVM	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., & Levine, A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences, 96(12), 6745-6750. Bellman, R. E. (2015). Adaptive Control Processes: A Guided Tour: Princeton University Press. Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. Paper presented at the Proceedings of the fifth annual workshop on Computational learning theory, Pittsburgh, Pennsylvania, USA. Boulesteix, A.-L. (2004). PLS Dimension Reduction for Classification with Microarray Data Statistical applications in genetics and molecular biology (Vol. 3, pp. 1). Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32. doi:10.1023/a:1010933404324 Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297. doi:10.1007/bf00994018 Cunningham, P. (2008). Dimension Reduction. In M. Cord & P. Cunningham (Eds.), Machine Learning Techniques for Multimedia: Case Studies on Organization and Retrieval (pp. 91-112). Berlin, Heidelberg: Springer Berlin Heidelberg. Dai, J. J., Lieu, L., & Rocke, D. (2006). Dimension reduction for classification with gene expression microarray data. Statistical applications in genetics and molecular biology, 5(1), 1147. Gordon, G. J., Jensen, R. V., Hsiao, L.-L., Gullans, S. R., Blumenstock, J. E., Ramaswamy, S., . . . Bueno, R. (2002). Translation of Microarray Data into Clinically Relevant Cancer Diagnostic Tests Using Gene Expression Ratios in Lung Cancer and Mesothelioma. Cancer Research, 62(17), 4963-4967. Granitto, P. M., Furlanello, C., Biasioli, F., & Gasperi, F. (2006). Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products. Chemometrics and Intelligent Laboratory Systems, 83(2), 83-90. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), 1157-1182. Guyon, I., Gunn, S. R., Ben-Hur, A., & Dror, G. (2004). Result Analysis of the NIPS 2003 Feature Selection Challenge. Paper presented at the NIPS. Hedenfalk , I., Duggan , D., Chen , Y., Radmacher , M., Bittner , M., Simon , R., . . . Trent , J. (2001). Gene-Expression Profiles in Hereditary Breast Cancer. New England Journal of Medicine, 344(8), 539-548. doi:10.1056/nejm200102223440801 Khan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M., Westermann, F., . . . Peterson, C. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature medicine, 7(6), 673-679. Lee Rodgers, J., & Nicewander, W. A. (1988). Thirteen Ways to Look at the Correlation Coefficient. The American Statistician, 42(1), 59-66. doi:10.1080/00031305.1988.10475524 Pal, M., & Foody, G. M. (2010). Feature selection for classification of hyperspectral data by SVM. IEEE Transactions on Geoscience and Remote Sensing, 48(5), 2297-2307. Pearson, K. (1901). LIII. On lines and planes of closest fit to systems of points in space. Philosophical Magazine Series 6, 2(11), 559-572. doi:10.1080/14786440109462720 Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., & Levine, A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences, 96(12), 6745-6750. Bellman, R. E. (2015). Adaptive Control Processes: A Guided Tour: Princeton University Press. Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. Paper presented at the Proceedings of the fifth annual workshop on Computational learning theory, Pittsburgh, Pennsylvania, USA. Boulesteix, A.-L. (2004). PLS Dimension Reduction for Classification with Microarray Data Statistical applications in genetics and molecular biology (Vol. 3, pp. 1). Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32. doi:10.1023/a:1010933404324 Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297. doi:10.1007/bf00994018 Cunningham, P. (2008). Dimension Reduction. In M. Cord & P. Cunningham (Eds.), Machine Learning Techniques for Multimedia: Case Studies on Organization and Retrieval (pp. 91-112). Berlin, Heidelberg: Springer Berlin Heidelberg. Dai, J. J., Lieu, L., & Rocke, D. (2006). Dimension reduction for classification with gene expression microarray data. Statistical applications in genetics and molecular biology, 5(1), 1147. Gordon, G. J., Jensen, R. V., Hsiao, L.-L., Gullans, S. R., Blumenstock, J. E., Ramaswamy, S., . . . Bueno, R. (2002). Translation of Microarray Data into Clinically Relevant Cancer Diagnostic Tests Using Gene Expression Ratios in Lung Cancer and Mesothelioma. Cancer Research, 62(17), 4963-4967. Granitto, P. M., Furlanello, C., Biasioli, F., & Gasperi, F. (2006). Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products. Chemometrics and Intelligent Laboratory Systems, 83(2), 83-90. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), 1157-1182. Guyon, I., Gunn, S. R., Ben-Hur, A., & Dror, G. (2004). Result Analysis of the NIPS 2003 Feature Selection Challenge. Paper presented at the NIPS. Hedenfalk , I., Duggan , D., Chen , Y., Radmacher , M., Bittner , M., Simon , R., . . . Trent , J. (2001). Gene-Expression Profiles in Hereditary Breast Cancer. New England Journal of Medicine, 344(8), 539-548. doi:10.1056/nejm200102223440801 Khan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M., Westermann, F., . . . Peterson, C. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature medicine, 7(6), 673-679. Lee Rodgers, J., & Nicewander, W. A. (1988). Thirteen Ways to Look at the Correlation Coefficient. The American Statistician, 42(1), 59-66. doi:10.1080/00031305.1988.10475524 Pal, M., & Foody, G. M. (2010). Feature selection for classification of hyperspectral data by SVM. IEEE Transactions on Geoscience and Remote Sensing, 48(5), 2297-2307. Pearson, K. (1901). LIII. On lines and planes of closest fit to systems of points in space. Philosophical Magazine Series 6, 2(11), 559-572. doi:10.1080/14786440109462720 Shipp, M. A., Ross, K. N., Tamayo, P., Weng, A. P., Kutok, J. L., Aguiar, R. C., . . . Pinkus, G. S. (2002). Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature medicine, 8(1), 68-74. Tang, J., Alelyani, S., & Liu, H. (2014). Feature selection for classification: A review. Data Classification: Algorithms and Applications, 37. Tin Kam, H. (1995, 14-16 Aug 1995). Random decision forests. Paper presented at the Proceedings of 3rd International Conference on Document Analysis and Recognition. Xu, X., & Wang, X. (2005). An Adaptive Network Intrusion Detection Method Based on PCA and Support Vector Machines. In X. Li, S. Wang, & Z. Y. Dong (Eds.), Advanced Data Mining and Applications: First International Conference, ADMA 2005, Wuhan, China, July 22-24, 2005. Proceedings (pp. 696-703). Berlin, Heidelberg: Springer Berlin Heidelberg. Yeung, K. Y., & Ruzzo, W. L. (2001). Principal component analysis for clustering gene expression data. Bioinformatics, 17(9), 763-774. doi:10.1093/bioinformatics/17.9.763 林宗勳，Support Vector Machine簡介	zh_TW

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM