Publications-Theses
Article View/Open
Publication Export
-
Google ScholarTM
NCCU Library
Citation Infomation
Related Publications in TAIR
題名 分類蛋白質質譜資料變數選取的探討
On Variable Selection of Classifying Proteomic Spectra Data作者 林婷婷 貢獻者 郭訓志
林婷婷關鍵詞 LARS
Forward Stagewise
LASSO
Group LASSO
Elastic Net
支持向量機
LARS
Forward Stagewise
LASSO
Group LASSO
Elastic Net
SVM日期 2011 上傳時間 30-Oct-2012 10:13:33 (UTC+8) 摘要 本研究所利用的資料是來自美國東維吉尼亞醫學院所提供的攝護腺癌蛋白質質譜資料,其資料有原始資料和另一筆經過事前處理過的資料,而本研究是利用事前處理過的資料來作實証分析。由於此種資料通常都是屬於高維度資料,故變數間具有高度相關的現象也很常見,因此從大量的特徵變數中選取到重要的特徵變數來準確的判斷攝護腺的病變程度成為一個非常普遍且重要的課題。那麼本研究的目的是欲探討各(具有懲罰項)迴歸模型對於分類蛋白質質譜資料之變數選取結果,藉由LARS、Stagewise、LASSO、Group LASSO和Elastic Net各(具有懲罰項)迴歸模型將變數選入的先後順序當作其排序所產生的判別結果與利用「統計量排序」(t檢定、ANOVA F檢定以及Kruskal-Wallis檢定)以及SVM「分錯率排序」的判別結果相比較。而分析的結果顯示,Group LASSO對於六種兩兩分類的分錯率,其分錯率趨勢的表現都較其他方法穩定,並不會有大起大落的現象發生,且最小分錯率也幾乎較其他方法理想。此外Group LASSO在四分類的判別結果在與其他方法相較下也顯出此法可得出最低的分錯率,亦表示若須同時判別四種類別時,相較於其他方法之下Group LASSO的判別準確度最優。
Our research uses the prostate proteomic spectra data which is offered by Eastern Virginia Medical School. The materials have raw data and preprocessed data. Our research uses the preprocessed data to do the analysis of real example. Because this kind of materials usually have high dimension, so it maybe has highly correlation between variables very common, therefore choose from a large number of characteristic variables to accurately determine the pathological change degree of the Prostate is become a very general and important subject. Then the purpose of our research wants to discuss every (penalized) regression model in variable selection results for classifying the proteomic spectra data. With LARS, Stagewise, LASSO, Group LASSO and Elastic Net, each variable is chosen successively by each (penalized) regression model, and it is regarded as each variable’s order then produce discrimination results. After that, we use their results to compare with using statistic order (t-test, ANOVA F-test and Kruskal-Wallis test) and SVM fault rate order. And the result of analyzing reveals Group LASSO to two by two of six kinds of rate by mistake that classify, the mistake rate behavior of trend is more stable than other ways, it doesn’t appear big rise or big fall phenomenon. Furthermore, this way’s mistake rate is almostly more ideal than other ways. Moreover, using Group LASSO to get the discrimination result of four classifications has the lowest mistake rate under comparing with other methods. In other words, when must distinguish four classifications in the same time, Group LASSO’s discrimination accuracy is optimum.參考文獻 一.中文部分陳詩佳 (2007),「使用Meta-Learning在蛋白質質譜資料特徵選取之探討」,國立政治大學統計系研究所碩士論文。黃仁澤 (2005),「對於高維度資料進行特徵選取-應用於分類蛋白質質譜儀資料」,國立政治大學統計系研究所碩士論文。蒲永孝和黃昌淵,「認識男人的殺手-前列腺癌」,正中書局,1997年。潘荔錞、蔡志彥和簡志青,「蛋白質體學在臨床醫學之應用」,化工資訊與商情月刊第3 期,2003年9月號。賴基銘,「癌症篩檢未來的展望:SELDI血清蛋白指紋圖譜的應用」,國家衛生研究院電子報,第52期,2004年6月25日。簡邦平,「攝護腺健康新知」,原水文化,2006年。二.英文部分Adam, B. L., Qu, Y., Davis, J. W., Ward, M. D., Clements, M. A., Cazares, L. H., Semmes, O. J.,Schellhammer, P. F., Yasui, Y., Feng, Z. and Wright, G. L. Jr. (2002), “Serum Protein Fingerprinting Coupled with a Pattern-matching Algorithm Distinguishes Prostate Cancer from Benign Prostate Hyperplasia and Healthy Men”, Cancer Research 62(13) 3609-3614.Degroeve, S., Baets, B. D.,Peer, Y. V. and Rouze, P. (2002), ”Feature Subset Selection for Splice Site Prediction”, Bioinformatics 18(2) 75-83.Efron, B., Hastie, T., Johnstone, I. and Tibshirani R. (2003), “Least Angle Regression”, Annals of Statistics 32(2) 407-499.Efron, B., Tibshirani, R., Storey, J. D. and Tusher, V. (2001), ”Empirical Bayes Analysis of a Microarray Experiment”, Journal of the American Statistical Association 96(456) 1151-1160.Fox, R. J. and Dimmic, M. W. (2006), ”A Two-Sample Bayesian t-test for Microarray Data”, BMC Bioinformatics 7:126.Friedman, J., Hastie, T. and Tibshirani, R. (2010), “A Note on the Group LASSO and a Sparse Group LASSO”.Guyon, I., Westion, J. and Barnhill, S. (2002), “Gene Selection for Cancer Classification Using Support Vector Machines”, Barnhill Bioinformatics 46 389-422.Hastie, T., Tibshirani, R. and Friedman, J. (2009), ” The Elements of Statistical Learning. Springer”.Hastie, T., Taylor, J., Tibshirani, R. and Walther, G. (2007), “Forward Stagewise Regression and the Monotone Lasso”, Electronic Journal of Statistics 1(1) 1-29.Issaq, H. L., Veenstra, T. D., Conrads, T. P. and Felschow, D. (2002), “The SELDI-TOF MS Approach to Proteomics: Protein Profiling and Biomarker Identification”, Biochemical and Biophysical Research Communications 587-592.Jiang, H., Deng, Y., Chen, H. S., Tao, L., Sha, Q., Chen, J., Tsai, C. J. and Zhang, S. (2004), ”Joint Analysis of Two Microarray Gene-Expression Data Sets to Select Lung Adenocarcinoma Marker Genes”, BMC Bioinformatics 5:81.Leng, C., Lin, Y. and Wahba, G. (2006), “A Note on the Lasso and Related Procedures in Model Selection”, Statistica Sinica 16 1273-1284.Ma, S. and Huang, J. (2005), ”Regularized ROC Method for Disease Classification and Biomarker Selection with Microarray Data”, Bioinformatics 21(24) 4356-4362.Meier, L., Geer, S. V. D. and Buhlmann, P. (2008), “The Group LASSO for Logistic Regression”, Journal of the Royal Statistical Society 70(1) 53-71.Park, M. Y. and Hastie, T. (2006), “L1 Regularization Path Algorithm for Generalized Linear Models”, Journal of the Royal Statistical Society 659-677.Somorjai, R. L., Dolenko, B. and Baumgartner, R. (2003), ”Class Prediction and Discovery Using Gene Microarray and Proteomics Mass Spectroscopy Data: curses, caveats, cautions”, Bioinformatics 19(12) 1484-1491.Tibshirani, R. (1996), “Regression Shrinkage and Selection via the Lasso”, Journal of the Royal Statistical Society 58(1) 267-288.West, M. (2003), “Bayesian Factor Regression Models in the Large p, Small n Paradigm”, Bayesian Statistics.Weston, J., Elisseeff, A. and Scholkopf, B. (2003), ”Use of the Zero-Norm with Linear Models and Kernel Methods”, BIOwulf Technologies 3 1439-1461.Yuan, M. and Lin, Y. (2006), “Model Selection and Estimation in Regression with Grouped Variables”, Journal of the Royal Statistical Society 68 49-67.Zou, H. and Hastie, T. (2004), “Regularization and Variable Selection via the Elastic Net”, Journal of the Roual Statistical Society 67 301-320. 描述 碩士
國立政治大學
統計研究所
98354021
100資料來源 http://thesis.lib.nccu.edu.tw/record/#G0098354021 資料類型 thesis dc.contributor.advisor 郭訓志 zh_TW dc.contributor.author (Authors) 林婷婷 zh_TW dc.creator (作者) 林婷婷 zh_TW dc.date (日期) 2011 en_US dc.date.accessioned 30-Oct-2012 10:13:33 (UTC+8) - dc.date.available 30-Oct-2012 10:13:33 (UTC+8) - dc.date.issued (上傳時間) 30-Oct-2012 10:13:33 (UTC+8) - dc.identifier (Other Identifiers) G0098354021 en_US dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/54170 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 統計研究所 zh_TW dc.description (描述) 98354021 zh_TW dc.description (描述) 100 zh_TW dc.description.abstract (摘要) 本研究所利用的資料是來自美國東維吉尼亞醫學院所提供的攝護腺癌蛋白質質譜資料,其資料有原始資料和另一筆經過事前處理過的資料,而本研究是利用事前處理過的資料來作實証分析。由於此種資料通常都是屬於高維度資料,故變數間具有高度相關的現象也很常見,因此從大量的特徵變數中選取到重要的特徵變數來準確的判斷攝護腺的病變程度成為一個非常普遍且重要的課題。那麼本研究的目的是欲探討各(具有懲罰項)迴歸模型對於分類蛋白質質譜資料之變數選取結果,藉由LARS、Stagewise、LASSO、Group LASSO和Elastic Net各(具有懲罰項)迴歸模型將變數選入的先後順序當作其排序所產生的判別結果與利用「統計量排序」(t檢定、ANOVA F檢定以及Kruskal-Wallis檢定)以及SVM「分錯率排序」的判別結果相比較。而分析的結果顯示,Group LASSO對於六種兩兩分類的分錯率,其分錯率趨勢的表現都較其他方法穩定,並不會有大起大落的現象發生,且最小分錯率也幾乎較其他方法理想。此外Group LASSO在四分類的判別結果在與其他方法相較下也顯出此法可得出最低的分錯率,亦表示若須同時判別四種類別時,相較於其他方法之下Group LASSO的判別準確度最優。 zh_TW dc.description.abstract (摘要) Our research uses the prostate proteomic spectra data which is offered by Eastern Virginia Medical School. The materials have raw data and preprocessed data. Our research uses the preprocessed data to do the analysis of real example. Because this kind of materials usually have high dimension, so it maybe has highly correlation between variables very common, therefore choose from a large number of characteristic variables to accurately determine the pathological change degree of the Prostate is become a very general and important subject. Then the purpose of our research wants to discuss every (penalized) regression model in variable selection results for classifying the proteomic spectra data. With LARS, Stagewise, LASSO, Group LASSO and Elastic Net, each variable is chosen successively by each (penalized) regression model, and it is regarded as each variable’s order then produce discrimination results. After that, we use their results to compare with using statistic order (t-test, ANOVA F-test and Kruskal-Wallis test) and SVM fault rate order. And the result of analyzing reveals Group LASSO to two by two of six kinds of rate by mistake that classify, the mistake rate behavior of trend is more stable than other ways, it doesn’t appear big rise or big fall phenomenon. Furthermore, this way’s mistake rate is almostly more ideal than other ways. Moreover, using Group LASSO to get the discrimination result of four classifications has the lowest mistake rate under comparing with other methods. In other words, when must distinguish four classifications in the same time, Group LASSO’s discrimination accuracy is optimum. en_US dc.description.tableofcontents 第一章 緒論-----1第一節 研究背景-----1第二節 研究動機與目的-----2 第三節 研究架構-----3第二章 蛋白質質譜資料介紹-----4第一節 表面強化雷射解析電離飛行質譜技術-----4第二節 攝護腺癌蛋白質質譜資料-----5第三章 文獻回顧-----7第四章 分析方法-----10第一節 分析流程-----10第二節 統計量排序-----14第三節 LARS、Stagewise、LASSO迴歸模型-----15第四節 Group LASSO迴歸模型-----20第五節 Elastic Net 迴歸模型-----21第六節 支持向量機SVM-----24第五章 實証分析-----26第一節 R函數之設定-----27第二節 探討兩兩分類之分錯率結果-----28第三節 探討四分類之分錯率結果-----42第六章 分析結果討論與建議-----45參考文獻-----47附錄一-----50 zh_TW dc.language.iso en_US - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0098354021 en_US dc.subject (關鍵詞) LARS zh_TW dc.subject (關鍵詞) Forward Stagewise zh_TW dc.subject (關鍵詞) LASSO zh_TW dc.subject (關鍵詞) Group LASSO zh_TW dc.subject (關鍵詞) Elastic Net zh_TW dc.subject (關鍵詞) 支持向量機 zh_TW dc.subject (關鍵詞) LARS en_US dc.subject (關鍵詞) Forward Stagewise en_US dc.subject (關鍵詞) LASSO en_US dc.subject (關鍵詞) Group LASSO en_US dc.subject (關鍵詞) Elastic Net en_US dc.subject (關鍵詞) SVM en_US dc.title (題名) 分類蛋白質質譜資料變數選取的探討 zh_TW dc.title (題名) On Variable Selection of Classifying Proteomic Spectra Data en_US dc.type (資料類型) thesis en dc.relation.reference (參考文獻) 一.中文部分陳詩佳 (2007),「使用Meta-Learning在蛋白質質譜資料特徵選取之探討」,國立政治大學統計系研究所碩士論文。黃仁澤 (2005),「對於高維度資料進行特徵選取-應用於分類蛋白質質譜儀資料」,國立政治大學統計系研究所碩士論文。蒲永孝和黃昌淵,「認識男人的殺手-前列腺癌」,正中書局,1997年。潘荔錞、蔡志彥和簡志青,「蛋白質體學在臨床醫學之應用」,化工資訊與商情月刊第3 期,2003年9月號。賴基銘,「癌症篩檢未來的展望:SELDI血清蛋白指紋圖譜的應用」,國家衛生研究院電子報,第52期,2004年6月25日。簡邦平,「攝護腺健康新知」,原水文化,2006年。二.英文部分Adam, B. L., Qu, Y., Davis, J. W., Ward, M. D., Clements, M. A., Cazares, L. H., Semmes, O. J.,Schellhammer, P. F., Yasui, Y., Feng, Z. and Wright, G. L. Jr. (2002), “Serum Protein Fingerprinting Coupled with a Pattern-matching Algorithm Distinguishes Prostate Cancer from Benign Prostate Hyperplasia and Healthy Men”, Cancer Research 62(13) 3609-3614.Degroeve, S., Baets, B. D.,Peer, Y. V. and Rouze, P. (2002), ”Feature Subset Selection for Splice Site Prediction”, Bioinformatics 18(2) 75-83.Efron, B., Hastie, T., Johnstone, I. and Tibshirani R. (2003), “Least Angle Regression”, Annals of Statistics 32(2) 407-499.Efron, B., Tibshirani, R., Storey, J. D. and Tusher, V. (2001), ”Empirical Bayes Analysis of a Microarray Experiment”, Journal of the American Statistical Association 96(456) 1151-1160.Fox, R. J. and Dimmic, M. W. (2006), ”A Two-Sample Bayesian t-test for Microarray Data”, BMC Bioinformatics 7:126.Friedman, J., Hastie, T. and Tibshirani, R. (2010), “A Note on the Group LASSO and a Sparse Group LASSO”.Guyon, I., Westion, J. and Barnhill, S. (2002), “Gene Selection for Cancer Classification Using Support Vector Machines”, Barnhill Bioinformatics 46 389-422.Hastie, T., Tibshirani, R. and Friedman, J. (2009), ” The Elements of Statistical Learning. Springer”.Hastie, T., Taylor, J., Tibshirani, R. and Walther, G. (2007), “Forward Stagewise Regression and the Monotone Lasso”, Electronic Journal of Statistics 1(1) 1-29.Issaq, H. L., Veenstra, T. D., Conrads, T. P. and Felschow, D. (2002), “The SELDI-TOF MS Approach to Proteomics: Protein Profiling and Biomarker Identification”, Biochemical and Biophysical Research Communications 587-592.Jiang, H., Deng, Y., Chen, H. S., Tao, L., Sha, Q., Chen, J., Tsai, C. J. and Zhang, S. (2004), ”Joint Analysis of Two Microarray Gene-Expression Data Sets to Select Lung Adenocarcinoma Marker Genes”, BMC Bioinformatics 5:81.Leng, C., Lin, Y. and Wahba, G. (2006), “A Note on the Lasso and Related Procedures in Model Selection”, Statistica Sinica 16 1273-1284.Ma, S. and Huang, J. (2005), ”Regularized ROC Method for Disease Classification and Biomarker Selection with Microarray Data”, Bioinformatics 21(24) 4356-4362.Meier, L., Geer, S. V. D. and Buhlmann, P. (2008), “The Group LASSO for Logistic Regression”, Journal of the Royal Statistical Society 70(1) 53-71.Park, M. Y. and Hastie, T. (2006), “L1 Regularization Path Algorithm for Generalized Linear Models”, Journal of the Royal Statistical Society 659-677.Somorjai, R. L., Dolenko, B. and Baumgartner, R. (2003), ”Class Prediction and Discovery Using Gene Microarray and Proteomics Mass Spectroscopy Data: curses, caveats, cautions”, Bioinformatics 19(12) 1484-1491.Tibshirani, R. (1996), “Regression Shrinkage and Selection via the Lasso”, Journal of the Royal Statistical Society 58(1) 267-288.West, M. (2003), “Bayesian Factor Regression Models in the Large p, Small n Paradigm”, Bayesian Statistics.Weston, J., Elisseeff, A. and Scholkopf, B. (2003), ”Use of the Zero-Norm with Linear Models and Kernel Methods”, BIOwulf Technologies 3 1439-1461.Yuan, M. and Lin, Y. (2006), “Model Selection and Estimation in Regression with Grouped Variables”, Journal of the Royal Statistical Society 68 49-67.Zou, H. and Hastie, T. (2004), “Regularization and Variable Selection via the Elastic Net”, Journal of the Roual Statistical Society 67 301-320. zh_TW