分類蛋白質質譜資料變數選取的探討

Publications-Theses

Article View/Open

pdf(1019)

Publication Export

Google Scholar^TM

題名	分類蛋白質質譜資料變數選取的探討 On Variable Selection of Classifying Proteomic Spectra Data
作者	林婷婷
貢獻者	郭訓志林婷婷
關鍵詞	LARS Forward Stagewise LASSO Group LASSO Elastic Net 支持向量機 LARS Forward Stagewise LASSO Group LASSO Elastic Net SVM
日期	2011
上傳時間	30-Oct-2012 10:13:33 (UTC+8)
摘要	本研究所利用的資料是來自美國東維吉尼亞醫學院所提供的攝護腺癌蛋白質質譜資料，其資料有原始資料和另一筆經過事前處理過的資料，而本研究是利用事前處理過的資料來作實証分析。由於此種資料通常都是屬於高維度資料，故變數間具有高度相關的現象也很常見，因此從大量的特徵變數中選取到重要的特徵變數來準確的判斷攝護腺的病變程度成為一個非常普遍且重要的課題。那麼本研究的目的是欲探討各(具有懲罰項)迴歸模型對於分類蛋白質質譜資料之變數選取結果，藉由LARS、Stagewise、LASSO、Group LASSO和Elastic Net各(具有懲罰項)迴歸模型將變數選入的先後順序當作其排序所產生的判別結果與利用「統計量排序」(t檢定、ANOVA F檢定以及Kruskal-Wallis檢定)以及SVM「分錯率排序」的判別結果相比較。而分析的結果顯示，Group LASSO對於六種兩兩分類的分錯率，其分錯率趨勢的表現都較其他方法穩定，並不會有大起大落的現象發生，且最小分錯率也幾乎較其他方法理想。此外Group LASSO在四分類的判別結果在與其他方法相較下也顯出此法可得出最低的分錯率，亦表示若須同時判別四種類別時，相較於其他方法之下Group LASSO的判別準確度最優。 Our research uses the prostate proteomic spectra data which is offered by Eastern Virginia Medical School. The materials have raw data and preprocessed data. Our research uses the preprocessed data to do the analysis of real example. Because this kind of materials usually have high dimension, so it maybe has highly correlation between variables very common, therefore choose from a large number of characteristic variables to accurately determine the pathological change degree of the Prostate is become a very general and important subject. Then the purpose of our research wants to discuss every (penalized) regression model in variable selection results for classifying the proteomic spectra data. With LARS, Stagewise, LASSO, Group LASSO and Elastic Net, each variable is chosen successively by each (penalized) regression model, and it is regarded as each variable’s order then produce discrimination results. After that, we use their results to compare with using statistic order (t-test, ANOVA F-test and Kruskal-Wallis test) and SVM fault rate order. And the result of analyzing reveals Group LASSO to two by two of six kinds of rate by mistake that classify, the mistake rate behavior of trend is more stable than other ways, it doesn’t appear big rise or big fall phenomenon. Furthermore, this way’s mistake rate is almostly more ideal than other ways. Moreover, using Group LASSO to get the discrimination result of four classifications has the lowest mistake rate under comparing with other methods. In other words, when must distinguish four classifications in the same time, Group LASSO’s discrimination accuracy is optimum.
參考文獻	一.中文部分陳詩佳 (2007)，「使用Meta-Learning在蛋白質質譜資料特徵選取之探討」，國立政治大學統計系研究所碩士論文。黃仁澤 (2005)，「對於高維度資料進行特徵選取-應用於分類蛋白質質譜儀資料」，國立政治大學統計系研究所碩士論文。蒲永孝和黃昌淵，「認識男人的殺手-前列腺癌」，正中書局，1997年。潘荔錞、蔡志彥和簡志青，「蛋白質體學在臨床醫學之應用」，化工資訊與商情月刊第3 期，2003年9月號。賴基銘，「癌症篩檢未來的展望：SELDI血清蛋白指紋圖譜的應用」，國家衛生研究院電子報，第52期，2004年6月25日。簡邦平，「攝護腺健康新知」，原水文化，2006年。二.英文部分 Adam, B. L., Qu, Y., Davis, J. W., Ward, M. D., Clements, M. A., Cazares, L. H., Semmes, O. J.,Schellhammer, P. F., Yasui, Y., Feng, Z. and Wright, G. L. Jr. (2002), “Serum Protein Fingerprinting Coupled with a Pattern-matching Algorithm Distinguishes Prostate Cancer from Benign Prostate Hyperplasia and Healthy Men”, Cancer Research 62(13) 3609-3614. Degroeve, S., Baets, B. D.,Peer, Y. V. and Rouze, P. (2002), ”Feature Subset Selection for Splice Site Prediction”, Bioinformatics 18(2) 75-83. Efron, B., Hastie, T., Johnstone, I. and Tibshirani R. (2003), “Least Angle Regression”, Annals of Statistics 32(2) 407-499. Efron, B., Tibshirani, R., Storey, J. D. and Tusher, V. (2001), ”Empirical Bayes Analysis of a Microarray Experiment”, Journal of the American Statistical Association 96(456) 1151-1160. Fox, R. J. and Dimmic, M. W. (2006), ”A Two-Sample Bayesian t-test for Microarray Data”, BMC Bioinformatics 7:126. Friedman, J., Hastie, T. and Tibshirani, R. (2010), “A Note on the Group LASSO and a Sparse Group LASSO”. Guyon, I., Westion, J. and Barnhill, S. (2002), “Gene Selection for Cancer Classification Using Support Vector Machines”, Barnhill Bioinformatics 46 389-422. Hastie, T., Tibshirani, R. and Friedman, J. (2009), ” The Elements of Statistical Learning. Springer”. Hastie, T., Taylor, J., Tibshirani, R. and Walther, G. (2007), “Forward Stagewise Regression and the Monotone Lasso”, Electronic Journal of Statistics 1(1) 1-29. Issaq, H. L., Veenstra, T. D., Conrads, T. P. and Felschow, D. (2002), “The SELDI-TOF MS Approach to Proteomics: Protein Profiling and Biomarker Identification”, Biochemical and Biophysical Research Communications 587-592. Jiang, H., Deng, Y., Chen, H. S., Tao, L., Sha, Q., Chen, J., Tsai, C. J. and Zhang, S. (2004), ”Joint Analysis of Two Microarray Gene-Expression Data Sets to Select Lung Adenocarcinoma Marker Genes”, BMC Bioinformatics 5:81. Leng, C., Lin, Y. and Wahba, G. (2006), “A Note on the Lasso and Related Procedures in Model Selection”, Statistica Sinica 16 1273-1284. Ma, S. and Huang, J. (2005), ”Regularized ROC Method for Disease Classification and Biomarker Selection with Microarray Data”, Bioinformatics 21(24) 4356-4362. Meier, L., Geer, S. V. D. and Buhlmann, P. (2008), “The Group LASSO for Logistic Regression”, Journal of the Royal Statistical Society 70(1) 53-71. Park, M. Y. and Hastie, T. (2006), “L1 Regularization Path Algorithm for Generalized Linear Models”, Journal of the Royal Statistical Society 659-677. Somorjai, R. L., Dolenko, B. and Baumgartner, R. (2003), ”Class Prediction and Discovery Using Gene Microarray and Proteomics Mass Spectroscopy Data: curses, caveats, cautions”, Bioinformatics 19(12) 1484-1491. Tibshirani, R. (1996), “Regression Shrinkage and Selection via the Lasso”, Journal of the Royal Statistical Society 58(1) 267-288. West, M. (2003), “Bayesian Factor Regression Models in the Large p, Small n Paradigm”, Bayesian Statistics. Weston, J., Elisseeff, A. and Scholkopf, B. (2003), ”Use of the Zero-Norm with Linear Models and Kernel Methods”, BIOwulf Technologies 3 1439-1461. Yuan, M. and Lin, Y. (2006), “Model Selection and Estimation in Regression with Grouped Variables”, Journal of the Royal Statistical Society 68 49-67. Zou, H. and Hastie, T. (2004), “Regularization and Variable Selection via the Elastic Net”, Journal of the Roual Statistical Society 67 301-320.
描述	碩士國立政治大學統計研究所 98354021 100
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0098354021
資料類型	thesis

dc.contributor.advisor	郭訓志	zh_TW
dc.contributor.author (Authors)	林婷婷	zh_TW
dc.creator (作者)	林婷婷	zh_TW
dc.date (日期)	2011	en_US
dc.date.accessioned	30-Oct-2012 10:13:33 (UTC+8)	-
dc.date.available	30-Oct-2012 10:13:33 (UTC+8)	-
dc.date.issued (上傳時間)	30-Oct-2012 10:13:33 (UTC+8)	-
dc.identifier (Other Identifiers)	G0098354021	en_US
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/54170	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	統計研究所	zh_TW
dc.description (描述)	98354021	zh_TW
dc.description (描述)	100	zh_TW
dc.description.abstract (摘要)	本研究所利用的資料是來自美國東維吉尼亞醫學院所提供的攝護腺癌蛋白質質譜資料，其資料有原始資料和另一筆經過事前處理過的資料，而本研究是利用事前處理過的資料來作實証分析。由於此種資料通常都是屬於高維度資料，故變數間具有高度相關的現象也很常見，因此從大量的特徵變數中選取到重要的特徵變數來準確的判斷攝護腺的病變程度成為一個非常普遍且重要的課題。那麼本研究的目的是欲探討各(具有懲罰項)迴歸模型對於分類蛋白質質譜資料之變數選取結果，藉由LARS、Stagewise、LASSO、Group LASSO和Elastic Net各(具有懲罰項)迴歸模型將變數選入的先後順序當作其排序所產生的判別結果與利用「統計量排序」(t檢定、ANOVA F檢定以及Kruskal-Wallis檢定)以及SVM「分錯率排序」的判別結果相比較。而分析的結果顯示，Group LASSO對於六種兩兩分類的分錯率，其分錯率趨勢的表現都較其他方法穩定，並不會有大起大落的現象發生，且最小分錯率也幾乎較其他方法理想。此外Group LASSO在四分類的判別結果在與其他方法相較下也顯出此法可得出最低的分錯率，亦表示若須同時判別四種類別時，相較於其他方法之下Group LASSO的判別準確度最優。	zh_TW
dc.description.abstract (摘要)	Our research uses the prostate proteomic spectra data which is offered by Eastern Virginia Medical School. The materials have raw data and preprocessed data. Our research uses the preprocessed data to do the analysis of real example. Because this kind of materials usually have high dimension, so it maybe has highly correlation between variables very common, therefore choose from a large number of characteristic variables to accurately determine the pathological change degree of the Prostate is become a very general and important subject. Then the purpose of our research wants to discuss every (penalized) regression model in variable selection results for classifying the proteomic spectra data. With LARS, Stagewise, LASSO, Group LASSO and Elastic Net, each variable is chosen successively by each (penalized) regression model, and it is regarded as each variable’s order then produce discrimination results. After that, we use their results to compare with using statistic order (t-test, ANOVA F-test and Kruskal-Wallis test) and SVM fault rate order. And the result of analyzing reveals Group LASSO to two by two of six kinds of rate by mistake that classify, the mistake rate behavior of trend is more stable than other ways, it doesn’t appear big rise or big fall phenomenon. Furthermore, this way’s mistake rate is almostly more ideal than other ways. Moreover, using Group LASSO to get the discrimination result of four classifications has the lowest mistake rate under comparing with other methods. In other words, when must distinguish four classifications in the same time, Group LASSO’s discrimination accuracy is optimum.	en_US
dc.description.tableofcontents	第一章緒論-----1 第一節研究背景-----1 第二節研究動機與目的-----2 第三節研究架構-----3 第二章蛋白質質譜資料介紹-----4 第一節表面強化雷射解析電離飛行質譜技術-----4 第二節攝護腺癌蛋白質質譜資料-----5 第三章文獻回顧-----7 第四章分析方法-----10 第一節分析流程-----10 第二節統計量排序-----14 第三節 LARS、Stagewise、LASSO迴歸模型-----15 第四節 Group LASSO迴歸模型-----20 第五節 Elastic Net 迴歸模型-----21 第六節支持向量機SVM-----24 第五章實証分析-----26 第一節 R函數之設定-----27 第二節探討兩兩分類之分錯率結果-----28 第三節探討四分類之分錯率結果-----42 第六章分析結果討論與建議-----45 參考文獻-----47 附錄一-----50	zh_TW
dc.language.iso	en_US	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0098354021	en_US
dc.subject (關鍵詞)	LARS	zh_TW
dc.subject (關鍵詞)	Forward Stagewise	zh_TW
dc.subject (關鍵詞)	LASSO	zh_TW
dc.subject (關鍵詞)	Group LASSO	zh_TW
dc.subject (關鍵詞)	Elastic Net	zh_TW
dc.subject (關鍵詞)	支持向量機	zh_TW
dc.subject (關鍵詞)	LARS	en_US
dc.subject (關鍵詞)	Forward Stagewise	en_US
dc.subject (關鍵詞)	LASSO	en_US
dc.subject (關鍵詞)	Group LASSO	en_US
dc.subject (關鍵詞)	Elastic Net	en_US
dc.subject (關鍵詞)	SVM	en_US
dc.title (題名)	分類蛋白質質譜資料變數選取的探討	zh_TW
dc.title (題名)	On Variable Selection of Classifying Proteomic Spectra Data	en_US
dc.type (資料類型)	thesis	en
dc.relation.reference (參考文獻)	一.中文部分陳詩佳 (2007)，「使用Meta-Learning在蛋白質質譜資料特徵選取之探討」，國立政治大學統計系研究所碩士論文。黃仁澤 (2005)，「對於高維度資料進行特徵選取-應用於分類蛋白質質譜儀資料」，國立政治大學統計系研究所碩士論文。蒲永孝和黃昌淵，「認識男人的殺手-前列腺癌」，正中書局，1997年。潘荔錞、蔡志彥和簡志青，「蛋白質體學在臨床醫學之應用」，化工資訊與商情月刊第3 期，2003年9月號。賴基銘，「癌症篩檢未來的展望：SELDI血清蛋白指紋圖譜的應用」，國家衛生研究院電子報，第52期，2004年6月25日。簡邦平，「攝護腺健康新知」，原水文化，2006年。二.英文部分 Adam, B. L., Qu, Y., Davis, J. W., Ward, M. D., Clements, M. A., Cazares, L. H., Semmes, O. J.,Schellhammer, P. F., Yasui, Y., Feng, Z. and Wright, G. L. Jr. (2002), “Serum Protein Fingerprinting Coupled with a Pattern-matching Algorithm Distinguishes Prostate Cancer from Benign Prostate Hyperplasia and Healthy Men”, Cancer Research 62(13) 3609-3614. Degroeve, S., Baets, B. D.,Peer, Y. V. and Rouze, P. (2002), ”Feature Subset Selection for Splice Site Prediction”, Bioinformatics 18(2) 75-83. Efron, B., Hastie, T., Johnstone, I. and Tibshirani R. (2003), “Least Angle Regression”, Annals of Statistics 32(2) 407-499. Efron, B., Tibshirani, R., Storey, J. D. and Tusher, V. (2001), ”Empirical Bayes Analysis of a Microarray Experiment”, Journal of the American Statistical Association 96(456) 1151-1160. Fox, R. J. and Dimmic, M. W. (2006), ”A Two-Sample Bayesian t-test for Microarray Data”, BMC Bioinformatics 7:126. Friedman, J., Hastie, T. and Tibshirani, R. (2010), “A Note on the Group LASSO and a Sparse Group LASSO”. Guyon, I., Westion, J. and Barnhill, S. (2002), “Gene Selection for Cancer Classification Using Support Vector Machines”, Barnhill Bioinformatics 46 389-422. Hastie, T., Tibshirani, R. and Friedman, J. (2009), ” The Elements of Statistical Learning. Springer”. Hastie, T., Taylor, J., Tibshirani, R. and Walther, G. (2007), “Forward Stagewise Regression and the Monotone Lasso”, Electronic Journal of Statistics 1(1) 1-29. Issaq, H. L., Veenstra, T. D., Conrads, T. P. and Felschow, D. (2002), “The SELDI-TOF MS Approach to Proteomics: Protein Profiling and Biomarker Identification”, Biochemical and Biophysical Research Communications 587-592. Jiang, H., Deng, Y., Chen, H. S., Tao, L., Sha, Q., Chen, J., Tsai, C. J. and Zhang, S. (2004), ”Joint Analysis of Two Microarray Gene-Expression Data Sets to Select Lung Adenocarcinoma Marker Genes”, BMC Bioinformatics 5:81. Leng, C., Lin, Y. and Wahba, G. (2006), “A Note on the Lasso and Related Procedures in Model Selection”, Statistica Sinica 16 1273-1284. Ma, S. and Huang, J. (2005), ”Regularized ROC Method for Disease Classification and Biomarker Selection with Microarray Data”, Bioinformatics 21(24) 4356-4362. Meier, L., Geer, S. V. D. and Buhlmann, P. (2008), “The Group LASSO for Logistic Regression”, Journal of the Royal Statistical Society 70(1) 53-71. Park, M. Y. and Hastie, T. (2006), “L1 Regularization Path Algorithm for Generalized Linear Models”, Journal of the Royal Statistical Society 659-677. Somorjai, R. L., Dolenko, B. and Baumgartner, R. (2003), ”Class Prediction and Discovery Using Gene Microarray and Proteomics Mass Spectroscopy Data: curses, caveats, cautions”, Bioinformatics 19(12) 1484-1491. Tibshirani, R. (1996), “Regression Shrinkage and Selection via the Lasso”, Journal of the Royal Statistical Society 58(1) 267-288. West, M. (2003), “Bayesian Factor Regression Models in the Large p, Small n Paradigm”, Bayesian Statistics. Weston, J., Elisseeff, A. and Scholkopf, B. (2003), ”Use of the Zero-Norm with Linear Models and Kernel Methods”, BIOwulf Technologies 3 1439-1461. Yuan, M. and Lin, Y. (2006), “Model Selection and Estimation in Regression with Grouped Variables”, Journal of the Royal Statistical Society 68 49-67. Zou, H. and Hastie, T. (2004), “Regularization and Variable Selection via the Elastic Net”, Journal of the Roual Statistical Society 67 301-320.	zh_TW

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM