Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

題名 分類蛋白質質譜資料變數選取的探討
On Variable Selection of Classifying Proteomic Spectra Data
作者 林婷婷
貢獻者 郭訓志
林婷婷
關鍵詞 LARS
Forward Stagewise
LASSO
Group LASSO
Elastic Net
支持向量機
LARS
Forward Stagewise
LASSO
Group LASSO
Elastic Net
SVM
日期 2011
上傳時間 30-Oct-2012 10:13:33 (UTC+8)
摘要 本研究所利用的資料是來自美國東維吉尼亞醫學院所提供的攝護腺癌蛋白質質譜資料,其資料有原始資料和另一筆經過事前處理過的資料,而本研究是利用事前處理過的資料來作實証分析。由於此種資料通常都是屬於高維度資料,故變數間具有高度相關的現象也很常見,因此從大量的特徵變數中選取到重要的特徵變數來準確的判斷攝護腺的病變程度成為一個非常普遍且重要的課題。那麼本研究的目的是欲探討各(具有懲罰項)迴歸模型對於分類蛋白質質譜資料之變數選取結果,藉由LARS、Stagewise、LASSO、Group LASSO和Elastic Net各(具有懲罰項)迴歸模型將變數選入的先後順序當作其排序所產生的判別結果與利用「統計量排序」(t檢定、ANOVA F檢定以及Kruskal-Wallis檢定)以及SVM「分錯率排序」的判別結果相比較。而分析的結果顯示,Group LASSO對於六種兩兩分類的分錯率,其分錯率趨勢的表現都較其他方法穩定,並不會有大起大落的現象發生,且最小分錯率也幾乎較其他方法理想。此外Group LASSO在四分類的判別結果在與其他方法相較下也顯出此法可得出最低的分錯率,亦表示若須同時判別四種類別時,相較於其他方法之下Group LASSO的判別準確度最優。
Our research uses the prostate proteomic spectra data which is offered by Eastern Virginia Medical School. The materials have raw data and preprocessed data. Our research uses the preprocessed data to do the analysis of real example. Because this kind of materials usually have high dimension, so it maybe has highly correlation between variables very common, therefore choose from a large number of characteristic variables to accurately determine the pathological change degree of the Prostate is become a very general and important subject. Then the purpose of our research wants to discuss every (penalized) regression model in variable selection results for classifying the proteomic spectra data. With LARS, Stagewise, LASSO, Group LASSO and Elastic Net, each variable is chosen successively by each (penalized) regression model, and it is regarded as each variable’s order then produce discrimination results. After that, we use their results to compare with using statistic order (t-test, ANOVA F-test and Kruskal-Wallis test) and SVM fault rate order. And the result of analyzing reveals Group LASSO to two by two of six kinds of rate by mistake that classify, the mistake rate behavior of trend is more stable than other ways, it doesn’t appear big rise or big fall phenomenon. Furthermore, this way’s mistake rate is almostly more ideal than other ways. Moreover, using Group LASSO to get the discrimination result of four classifications has the lowest mistake rate under comparing with other methods. In other words, when must distinguish four classifications in the same time, Group LASSO’s discrimination accuracy is optimum.
參考文獻 一.中文部分
陳詩佳 (2007),「使用Meta-Learning在蛋白質質譜資料特徵選取之探討」,國立政治大學統計系研究所碩士論文。
黃仁澤 (2005),「對於高維度資料進行特徵選取-應用於分類蛋白質質譜儀資料」,國立政治大學統計系研究所碩士論文。
蒲永孝和黃昌淵,「認識男人的殺手-前列腺癌」,正中書局,1997年。
潘荔錞、蔡志彥和簡志青,「蛋白質體學在臨床醫學之應用」,化工資訊與商情月刊第3 期,2003年9月號。
賴基銘,「癌症篩檢未來的展望:SELDI血清蛋白指紋圖譜的應用」,國家衛生研究院電子報,第52期,2004年6月25日。
簡邦平,「攝護腺健康新知」,原水文化,2006年。

二.英文部分
Adam, B. L., Qu, Y., Davis, J. W., Ward, M. D., Clements, M. A., Cazares, L. H., Semmes, O. J.,Schellhammer, P. F., Yasui, Y., Feng, Z. and Wright, G. L. Jr. (2002), “Serum Protein Fingerprinting Coupled with a Pattern-matching Algorithm Distinguishes Prostate Cancer from Benign Prostate Hyperplasia and Healthy Men”, Cancer Research 62(13) 3609-3614.
Degroeve, S., Baets, B. D.,Peer, Y. V. and Rouze, P. (2002), ”Feature Subset Selection for Splice Site Prediction”, Bioinformatics 18(2) 75-83.
Efron, B., Hastie, T., Johnstone, I. and Tibshirani R. (2003), “Least Angle Regression”, Annals of Statistics 32(2) 407-499.
Efron, B., Tibshirani, R., Storey, J. D. and Tusher, V. (2001), ”Empirical Bayes Analysis of a Microarray Experiment”, Journal of the American Statistical Association 96(456) 1151-1160.
Fox, R. J. and Dimmic, M. W. (2006), ”A Two-Sample Bayesian t-test for Microarray Data”, BMC Bioinformatics 7:126.
Friedman, J., Hastie, T. and Tibshirani, R. (2010), “A Note on the Group LASSO and a Sparse Group LASSO”.
Guyon, I., Westion, J. and Barnhill, S. (2002), “Gene Selection for Cancer Classification Using Support Vector Machines”, Barnhill Bioinformatics 46 389-422.
Hastie, T., Tibshirani, R. and Friedman, J. (2009), ” The Elements of Statistical Learning. Springer”.
Hastie, T., Taylor, J., Tibshirani, R. and Walther, G. (2007), “Forward Stagewise Regression and the Monotone Lasso”, Electronic Journal of Statistics 1(1) 1-29.
Issaq, H. L., Veenstra, T. D., Conrads, T. P. and Felschow, D. (2002), “The SELDI-TOF MS Approach to Proteomics: Protein Profiling and Biomarker Identification”, Biochemical and Biophysical Research Communications 587-592.
Jiang, H., Deng, Y., Chen, H. S., Tao, L., Sha, Q., Chen, J., Tsai, C. J. and Zhang, S. (2004), ”Joint Analysis of Two Microarray Gene-Expression Data Sets to Select Lung Adenocarcinoma Marker Genes”, BMC Bioinformatics 5:81.
Leng, C., Lin, Y. and Wahba, G. (2006), “A Note on the Lasso and Related Procedures in Model Selection”, Statistica Sinica 16 1273-1284.
Ma, S. and Huang, J. (2005), ”Regularized ROC Method for Disease Classification and Biomarker Selection with Microarray Data”, Bioinformatics 21(24) 4356-4362.
Meier, L., Geer, S. V. D. and Buhlmann, P. (2008), “The Group LASSO for Logistic Regression”, Journal of the Royal Statistical Society 70(1) 53-71.
Park, M. Y. and Hastie, T. (2006), “L1 Regularization Path Algorithm for Generalized Linear Models”, Journal of the Royal Statistical Society 659-677.
Somorjai, R. L., Dolenko, B. and Baumgartner, R. (2003), ”Class Prediction and Discovery Using Gene Microarray and Proteomics Mass Spectroscopy Data: curses, caveats, cautions”, Bioinformatics 19(12) 1484-1491.
Tibshirani, R. (1996), “Regression Shrinkage and Selection via the Lasso”, Journal of the Royal Statistical Society 58(1) 267-288.
West, M. (2003), “Bayesian Factor Regression Models in the Large p, Small n Paradigm”, Bayesian Statistics.
Weston, J., Elisseeff, A. and Scholkopf, B. (2003), ”Use of the Zero-Norm with Linear Models and Kernel Methods”, BIOwulf Technologies 3 1439-1461.
Yuan, M. and Lin, Y. (2006), “Model Selection and Estimation in Regression with Grouped Variables”, Journal of the Royal Statistical Society 68 49-67.
Zou, H. and Hastie, T. (2004), “Regularization and Variable Selection via the Elastic Net”, Journal of the Roual Statistical Society 67 301-320.
描述 碩士
國立政治大學
統計研究所
98354021
100
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0098354021
資料類型 thesis
dc.contributor.advisor 郭訓志zh_TW
dc.contributor.author (Authors) 林婷婷zh_TW
dc.creator (作者) 林婷婷zh_TW
dc.date (日期) 2011en_US
dc.date.accessioned 30-Oct-2012 10:13:33 (UTC+8)-
dc.date.available 30-Oct-2012 10:13:33 (UTC+8)-
dc.date.issued (上傳時間) 30-Oct-2012 10:13:33 (UTC+8)-
dc.identifier (Other Identifiers) G0098354021en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/54170-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 統計研究所zh_TW
dc.description (描述) 98354021zh_TW
dc.description (描述) 100zh_TW
dc.description.abstract (摘要) 本研究所利用的資料是來自美國東維吉尼亞醫學院所提供的攝護腺癌蛋白質質譜資料,其資料有原始資料和另一筆經過事前處理過的資料,而本研究是利用事前處理過的資料來作實証分析。由於此種資料通常都是屬於高維度資料,故變數間具有高度相關的現象也很常見,因此從大量的特徵變數中選取到重要的特徵變數來準確的判斷攝護腺的病變程度成為一個非常普遍且重要的課題。那麼本研究的目的是欲探討各(具有懲罰項)迴歸模型對於分類蛋白質質譜資料之變數選取結果,藉由LARS、Stagewise、LASSO、Group LASSO和Elastic Net各(具有懲罰項)迴歸模型將變數選入的先後順序當作其排序所產生的判別結果與利用「統計量排序」(t檢定、ANOVA F檢定以及Kruskal-Wallis檢定)以及SVM「分錯率排序」的判別結果相比較。而分析的結果顯示,Group LASSO對於六種兩兩分類的分錯率,其分錯率趨勢的表現都較其他方法穩定,並不會有大起大落的現象發生,且最小分錯率也幾乎較其他方法理想。此外Group LASSO在四分類的判別結果在與其他方法相較下也顯出此法可得出最低的分錯率,亦表示若須同時判別四種類別時,相較於其他方法之下Group LASSO的判別準確度最優。zh_TW
dc.description.abstract (摘要) Our research uses the prostate proteomic spectra data which is offered by Eastern Virginia Medical School. The materials have raw data and preprocessed data. Our research uses the preprocessed data to do the analysis of real example. Because this kind of materials usually have high dimension, so it maybe has highly correlation between variables very common, therefore choose from a large number of characteristic variables to accurately determine the pathological change degree of the Prostate is become a very general and important subject. Then the purpose of our research wants to discuss every (penalized) regression model in variable selection results for classifying the proteomic spectra data. With LARS, Stagewise, LASSO, Group LASSO and Elastic Net, each variable is chosen successively by each (penalized) regression model, and it is regarded as each variable’s order then produce discrimination results. After that, we use their results to compare with using statistic order (t-test, ANOVA F-test and Kruskal-Wallis test) and SVM fault rate order. And the result of analyzing reveals Group LASSO to two by two of six kinds of rate by mistake that classify, the mistake rate behavior of trend is more stable than other ways, it doesn’t appear big rise or big fall phenomenon. Furthermore, this way’s mistake rate is almostly more ideal than other ways. Moreover, using Group LASSO to get the discrimination result of four classifications has the lowest mistake rate under comparing with other methods. In other words, when must distinguish four classifications in the same time, Group LASSO’s discrimination accuracy is optimum.en_US
dc.description.tableofcontents 第一章 緒論-----1
第一節 研究背景-----1
第二節 研究動機與目的-----2
第三節 研究架構-----3
第二章 蛋白質質譜資料介紹-----4
第一節 表面強化雷射解析電離飛行質譜技術-----4
第二節 攝護腺癌蛋白質質譜資料-----5
第三章 文獻回顧-----7
第四章 分析方法-----10
第一節 分析流程-----10
第二節 統計量排序-----14
第三節 LARS、Stagewise、LASSO迴歸模型-----15
第四節 Group LASSO迴歸模型-----20
第五節 Elastic Net 迴歸模型-----21
第六節 支持向量機SVM-----24
第五章 實証分析-----26
第一節 R函數之設定-----27
第二節 探討兩兩分類之分錯率結果-----28
第三節 探討四分類之分錯率結果-----42
第六章 分析結果討論與建議-----45
參考文獻-----47
附錄一-----50
zh_TW
dc.language.iso en_US-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0098354021en_US
dc.subject (關鍵詞) LARSzh_TW
dc.subject (關鍵詞) Forward Stagewisezh_TW
dc.subject (關鍵詞) LASSOzh_TW
dc.subject (關鍵詞) Group LASSOzh_TW
dc.subject (關鍵詞) Elastic Netzh_TW
dc.subject (關鍵詞) 支持向量機zh_TW
dc.subject (關鍵詞) LARSen_US
dc.subject (關鍵詞) Forward Stagewiseen_US
dc.subject (關鍵詞) LASSOen_US
dc.subject (關鍵詞) Group LASSOen_US
dc.subject (關鍵詞) Elastic Neten_US
dc.subject (關鍵詞) SVMen_US
dc.title (題名) 分類蛋白質質譜資料變數選取的探討zh_TW
dc.title (題名) On Variable Selection of Classifying Proteomic Spectra Dataen_US
dc.type (資料類型) thesisen
dc.relation.reference (參考文獻) 一.中文部分
陳詩佳 (2007),「使用Meta-Learning在蛋白質質譜資料特徵選取之探討」,國立政治大學統計系研究所碩士論文。
黃仁澤 (2005),「對於高維度資料進行特徵選取-應用於分類蛋白質質譜儀資料」,國立政治大學統計系研究所碩士論文。
蒲永孝和黃昌淵,「認識男人的殺手-前列腺癌」,正中書局,1997年。
潘荔錞、蔡志彥和簡志青,「蛋白質體學在臨床醫學之應用」,化工資訊與商情月刊第3 期,2003年9月號。
賴基銘,「癌症篩檢未來的展望:SELDI血清蛋白指紋圖譜的應用」,國家衛生研究院電子報,第52期,2004年6月25日。
簡邦平,「攝護腺健康新知」,原水文化,2006年。

二.英文部分
Adam, B. L., Qu, Y., Davis, J. W., Ward, M. D., Clements, M. A., Cazares, L. H., Semmes, O. J.,Schellhammer, P. F., Yasui, Y., Feng, Z. and Wright, G. L. Jr. (2002), “Serum Protein Fingerprinting Coupled with a Pattern-matching Algorithm Distinguishes Prostate Cancer from Benign Prostate Hyperplasia and Healthy Men”, Cancer Research 62(13) 3609-3614.
Degroeve, S., Baets, B. D.,Peer, Y. V. and Rouze, P. (2002), ”Feature Subset Selection for Splice Site Prediction”, Bioinformatics 18(2) 75-83.
Efron, B., Hastie, T., Johnstone, I. and Tibshirani R. (2003), “Least Angle Regression”, Annals of Statistics 32(2) 407-499.
Efron, B., Tibshirani, R., Storey, J. D. and Tusher, V. (2001), ”Empirical Bayes Analysis of a Microarray Experiment”, Journal of the American Statistical Association 96(456) 1151-1160.
Fox, R. J. and Dimmic, M. W. (2006), ”A Two-Sample Bayesian t-test for Microarray Data”, BMC Bioinformatics 7:126.
Friedman, J., Hastie, T. and Tibshirani, R. (2010), “A Note on the Group LASSO and a Sparse Group LASSO”.
Guyon, I., Westion, J. and Barnhill, S. (2002), “Gene Selection for Cancer Classification Using Support Vector Machines”, Barnhill Bioinformatics 46 389-422.
Hastie, T., Tibshirani, R. and Friedman, J. (2009), ” The Elements of Statistical Learning. Springer”.
Hastie, T., Taylor, J., Tibshirani, R. and Walther, G. (2007), “Forward Stagewise Regression and the Monotone Lasso”, Electronic Journal of Statistics 1(1) 1-29.
Issaq, H. L., Veenstra, T. D., Conrads, T. P. and Felschow, D. (2002), “The SELDI-TOF MS Approach to Proteomics: Protein Profiling and Biomarker Identification”, Biochemical and Biophysical Research Communications 587-592.
Jiang, H., Deng, Y., Chen, H. S., Tao, L., Sha, Q., Chen, J., Tsai, C. J. and Zhang, S. (2004), ”Joint Analysis of Two Microarray Gene-Expression Data Sets to Select Lung Adenocarcinoma Marker Genes”, BMC Bioinformatics 5:81.
Leng, C., Lin, Y. and Wahba, G. (2006), “A Note on the Lasso and Related Procedures in Model Selection”, Statistica Sinica 16 1273-1284.
Ma, S. and Huang, J. (2005), ”Regularized ROC Method for Disease Classification and Biomarker Selection with Microarray Data”, Bioinformatics 21(24) 4356-4362.
Meier, L., Geer, S. V. D. and Buhlmann, P. (2008), “The Group LASSO for Logistic Regression”, Journal of the Royal Statistical Society 70(1) 53-71.
Park, M. Y. and Hastie, T. (2006), “L1 Regularization Path Algorithm for Generalized Linear Models”, Journal of the Royal Statistical Society 659-677.
Somorjai, R. L., Dolenko, B. and Baumgartner, R. (2003), ”Class Prediction and Discovery Using Gene Microarray and Proteomics Mass Spectroscopy Data: curses, caveats, cautions”, Bioinformatics 19(12) 1484-1491.
Tibshirani, R. (1996), “Regression Shrinkage and Selection via the Lasso”, Journal of the Royal Statistical Society 58(1) 267-288.
West, M. (2003), “Bayesian Factor Regression Models in the Large p, Small n Paradigm”, Bayesian Statistics.
Weston, J., Elisseeff, A. and Scholkopf, B. (2003), ”Use of the Zero-Norm with Linear Models and Kernel Methods”, BIOwulf Technologies 3 1439-1461.
Yuan, M. and Lin, Y. (2006), “Model Selection and Estimation in Regression with Grouped Variables”, Journal of the Royal Statistical Society 68 49-67.
Zou, H. and Hastie, T. (2004), “Regularization and Variable Selection via the Elastic Net”, Journal of the Roual Statistical Society 67 301-320.
zh_TW