使用AUC特徵選取方法在蛋白質質譜儀資料分類之應用

Publications-Theses

Article View/Open

html(298)

Publication Export

Google Scholar^TM

題名	使用AUC特徵選取方法在蛋白質質譜儀資料分類之應用 An AUC criterion for feature selection on classifying proteomic spectra data
作者	葉勝宗
貢獻者	張源俊<br>郭訓志葉勝宗
關鍵詞	表面增強雷射脫附遊離/飛行時間質譜特徵選取分類 ROC曲線下面積支援向量機 AUC feature selection classification segmentation SELDI SVM
日期	2005
上傳時間	2009-09-14
摘要	表面增強雷射脫附遊離/飛行時間質譜(SELDI-TOF-MS)是種屬於高維度的蛋白質質譜儀資料，主要是用來偵測蛋白質分子的表現。由於SELDI技術的限制，導致掃描出來的質譜儀資料往往存在誤差與雜訊，因此在分析前通常會先針對原始資料進行低階的事前處理，步驟包括去除基線、正規化、峰偵測(peak detection)與峰調準(peak alignment)。本文中所探討前列腺癌資料，可分成正常、良性腫瘤、癌症初期與癌症末期四種類別。我們分析及比較兩筆事前處理的蛋白質質譜資料，包括我們自行處理的以及Adam等人所處理的資料。為了解決SELDI在偵測分子質量時常出現的位移誤差以及同位素的問題，我們提出以”質荷比段落”當作新的特徵變數的想法來進行分析。本文利用「ROC曲線下面積」(AUC)當作選取的準則來挑選出重要的質荷比段落，而分類方法則採用支援向量機(SVM)。在四分類的分類結果中，我們自行處理的事前處理資可以得到訓練資料89%及測試資料63 %的正確率。而Adam等人所處理的事前處理資料，則得到訓練資料94%及測試資料86 %的正確率。本研究結果指出不同事前處理的方法對分類結果確實有影響，同時也驗證了利用”特徵變數段落”的方法來進行分析的可行性。 The surface enhanced laser desorption/ionization time-of-flight mass spectrometry (SELDI-TOF MS) is a technique for presenting the expression of molecular masses. It is obvious that every spectrum has a huge dimension of features. In order to analyze these types of spectra samples, preprocessing steps are necessary. The steps of preprocessing include baseline subtraction, normalization, peak detection, and alignment. In our study, we use a prostate cancer data for demonstration. This prostate cancer data can be classified into four categories, namely, healthy men, benign prostate hyperplasia, early stage prostate cancer, and late stage prostate cancer. We analyzed both the preprocessed data processed by ourselves and the preprocessed data done by Adam et al.. In this thesis, we use segmentations of features as “new features” in attempt to solve problems due to location shifts and isotopes. The selection of important segmentations was based on the values of AUC and the SVM was applied for classification. For four-class classification, 94 % and 86 % of accuracy were obtained for training samples and validation samples, respectively, by using Dr. Adam et al.’s preprocessed data, and 89% for training samples, and 63% for validation samples by using our preprocessed data. This study suggested that the preprocessed method does have effect on classification result and a reasonable classification result can be obtained by using segmentations of features.
參考文獻	Aloaydin, E.(2004). Introduction To Machine learning. The MIT Press. Adam, B.L., Qu, Y., Davis, J.W., Ward, M.D., Clements, M.A., Cazares, L.H., Semmes, O.J., Schellhammer, P.F., Yasui, Y., Feng, Z., Wright, G.L. Jr.(2002). Serum Protein Fingerprinting Coupled with a Pattern-matching Algorithm Distinguishes Prostate Cancer from Benign Prostate Hyperplasia and Healthy Men. CANCER RESEARCH 62(13), 3609-14. Baggerly, K. A., Morris, J. S. and Coombes, K.R.(2004). Reproducibility of SELDI-TOF protein patterns in serum: comparing datasets from different experiments. Bioinformatics 20(5), 777-85. Cortes, C. and Mohri, M.(2003). AUC Optimization vs. Error Rate Minimization. Advances in Neural Information Processing System, 15. Conrads, T.P., Zhou, M., Petricoin, E.F.,Liotta,L. and Veenstra, T.D.(2003). Cancer diagnosis using proteomic patterns. Expert Rev Mol Diagn 3(4):411-20 Drucker, H., Christopher, J. C., Burges, Kaufman, L., Smola, A.J., Vapnik, V.(1996). Support Vector Regression Machines. Neural Information Processing Systems 9, 155-161 Green, D. M. and Swets, J. A. (1966). Signal Detection Theory and Psychophysics. John Wiley & Sons, New York. Hanley, J.A. and McNeil, B. J.(1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143, 29-36. Hutchens, T., and Yip, T. (1993). New desorption strategies for the mass spectrometric analysis of macromolecules. Rapid Communications in Mass Spectrometry 7, 576-580. Kevin R. Coombes, John M. Koomen, Keith A. Baggerly, Jeffrey S. Morris, and Ryuji Kobayashi. Understanding the characteristics of mass spectrometry data through the use of simulation. Cancer Informatics 2005, 1(1) 41-52. Li, J., Zhang, Z., Rosenzweig, J., Wang, Y.Y., Chan, D.W.(2002). Proteomics and bioinformatics approaches for identification of serum biomarkers to detect breast cancer. Clinical Chemistry 48, 1296-1304. Lilien, R.H., Farid, H. and Donald, B.R.(2003). Probabilistic Disease Classification of Expression-Dependent Proteomic Data from Mass Spectrometry of Human Serum. Journal of Computational Biology 10(6), 925-946. Lyons-Weiler, J., Pelikan, R., Zeh,H. J. , Whitcomb, D. C., Malehorn,D. E., Bigbee,W.L., and Hauskrecht M.( 2005).Assessing the Statistical Significance of the Achieved Classification Error of Classifiers Constructed using Serum Peptide Profiles, and a Prescription for Random Sampling Repeated Studies for Massive High-Throughput Genomic and Proteomic Studies. Cancer Informatics 1(1), 53-77. Pontil, M., Rifkin, R. and Evgeniou, T.(1999). From Regression to Classification in Support Vector Machines. European Symposium on Artificial Neural Networks. Petricoin, E.F., Ardekani, A.M., Hitt, B.A., Levine, P.J., Fusaro, V.A., Steinberg, S.M., Mills, G.B., Simone, C., Fishman, D.A., Kohn, E.C., Liotta, L.A.(2002). Use of proteomic patterns in serum to identify ovarian cancer. Lancet 359, 572-577. Qu, Y., Adam, B.L., Thornquist, M., Potter, J.D., Thompson, M.L., Yasui, Y., Davis, J., Schellhammer,P. F., Cazares,L., Clements,M.A., Wright, Jr.G.L., and Feng, Z.(2003).Data Reduction Using a Discrete Wavelet Transform in Discriminant Analysis of Very High Dimensionality Data. Biometrics 59, 143–151. Reddy, G. and Dalmasso E. A. (2003). SELDI ProteinChip Array Technology: Protein-Based Predictive Medicine and Drug Discovery Applications. Journal of Biomedicine and Biotechnology 4, 237-241 Tang, N., Tornatore, P. & Weinberger, S.R. (2004). Current developments in SELDI affinity technology. Mass Spec. Rev. 23, 34−44. Vapnik, V. (1995) .The Nature of Statistical Learning Theory. Springer Verlag,.
描述	碩士國立政治大學統計研究所 93354016 94
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0093354016
資料類型	thesis

dc.contributor.advisor	張源俊<br>郭訓志	zh_TW
dc.contributor.author (Authors)	葉勝宗	zh_TW
dc.creator (作者)	葉勝宗	zh_TW
dc.date (日期)	2005	en_US
dc.date.accessioned	2009-09-14	-
dc.date.available	2009-09-14	-
dc.date.issued (上傳時間)	2009-09-14	-
dc.identifier (Other Identifiers)	G0093354016	en_US
dc.identifier.uri (URI)	https://nccur.lib.nccu.edu.tw/handle/140.119/30902	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	統計研究所	zh_TW
dc.description (描述)	93354016	zh_TW
dc.description (描述)	94	zh_TW
dc.description.abstract (摘要)	表面增強雷射脫附遊離/飛行時間質譜(SELDI-TOF-MS)是種屬於高維度的蛋白質質譜儀資料，主要是用來偵測蛋白質分子的表現。由於SELDI技術的限制，導致掃描出來的質譜儀資料往往存在誤差與雜訊，因此在分析前通常會先針對原始資料進行低階的事前處理，步驟包括去除基線、正規化、峰偵測(peak detection)與峰調準(peak alignment)。本文中所探討前列腺癌資料，可分成正常、良性腫瘤、癌症初期與癌症末期四種類別。我們分析及比較兩筆事前處理的蛋白質質譜資料，包括我們自行處理的以及Adam等人所處理的資料。為了解決SELDI在偵測分子質量時常出現的位移誤差以及同位素的問題，我們提出以”質荷比段落”當作新的特徵變數的想法來進行分析。本文利用「ROC曲線下面積」(AUC)當作選取的準則來挑選出重要的質荷比段落，而分類方法則採用支援向量機(SVM)。在四分類的分類結果中，我們自行處理的事前處理資可以得到訓練資料89%及測試資料63 %的正確率。而Adam等人所處理的事前處理資料，則得到訓練資料94%及測試資料86 %的正確率。本研究結果指出不同事前處理的方法對分類結果確實有影響，同時也驗證了利用”特徵變數段落”的方法來進行分析的可行性。	zh_TW
dc.description.abstract (摘要)	The surface enhanced laser desorption/ionization time-of-flight mass spectrometry (SELDI-TOF MS) is a technique for presenting the expression of molecular masses. It is obvious that every spectrum has a huge dimension of features. In order to analyze these types of spectra samples, preprocessing steps are necessary. The steps of preprocessing include baseline subtraction, normalization, peak detection, and alignment. In our study, we use a prostate cancer data for demonstration. This prostate cancer data can be classified into four categories, namely, healthy men, benign prostate hyperplasia, early stage prostate cancer, and late stage prostate cancer. We analyzed both the preprocessed data processed by ourselves and the preprocessed data done by Adam et al.. In this thesis, we use segmentations of features as “new features” in attempt to solve problems due to location shifts and isotopes. The selection of important segmentations was based on the values of AUC and the SVM was applied for classification. For four-class classification, 94 % and 86 % of accuracy were obtained for training samples and validation samples, respectively, by using Dr. Adam et al.’s preprocessed data, and 89% for training samples, and 63% for validation samples by using our preprocessed data. This study suggested that the preprocessed method does have effect on classification result and a reasonable classification result can be obtained by using segmentations of features.	en_US
dc.description.tableofcontents	Abstract i Acknowledgments i 1. Introduction 8 2. Description of Data 8 2.1. Surface-Enhanced Laser Desorption / Ionization Time-of-Flight (SELDI-TOF) 1 2.2. Samples 2 2.3. Data Preprocessing 5 3. Literature Review 7 4. Methodology 11 4.1. Dimension reduction 11 4.1.1. The Area Under the Receiver-Operating Characteristic curve (AUC) 12 4.2. Classification 13 4.2.1. Support Vector Machine 14 4.2.2. The separable case 15 4.2.3. The Non-separable case(Soft Margin Hyperplane) 15 5. Data Analysis 17 5.1. Preprocessing steps 17 5.2. Analyses and Result 24 5.2.1. Selection of training data 24 5.2.2. Segmentation of features 25 5.2.3. Ranking segmentation based on AUC 25 5.2.4. Pairwise classification based on top ranked segmentations 26 5.2.5. 4-class classification 27 5.2.6. Sensitivity and Specificity 31 6. Conclusion and Future Works 33 6.1. Conclusion 33 6.2. Future Works 34 Reference 35 Appendices 38	zh_TW
dc.language.iso	en_US	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0093354016	en_US
dc.subject (關鍵詞)	表面增強雷射脫附遊離/飛行時間質譜	zh_TW
dc.subject (關鍵詞)	特徵選取	zh_TW
dc.subject (關鍵詞)	分類	zh_TW
dc.subject (關鍵詞)	ROC曲線下面積	zh_TW
dc.subject (關鍵詞)	支援向量機	zh_TW
dc.subject (關鍵詞)	AUC	en_US
dc.subject (關鍵詞)	feature selection	en_US
dc.subject (關鍵詞)	classification	en_US
dc.subject (關鍵詞)	segmentation	en_US
dc.subject (關鍵詞)	SELDI	en_US
dc.subject (關鍵詞)	SVM	en_US
dc.title (題名)	使用AUC特徵選取方法在蛋白質質譜儀資料分類之應用	zh_TW
dc.title (題名)	An AUC criterion for feature selection on classifying proteomic spectra data	en_US
dc.type (資料類型)	thesis	en
dc.relation.reference (參考文獻)	Aloaydin, E.(2004). Introduction To Machine learning. The MIT Press.	zh_TW
dc.relation.reference (參考文獻)	Adam, B.L., Qu, Y., Davis, J.W., Ward, M.D., Clements, M.A., Cazares, L.H., Semmes, O.J., Schellhammer, P.F., Yasui, Y., Feng, Z., Wright, G.L. Jr.(2002). Serum Protein Fingerprinting Coupled with a Pattern-matching Algorithm Distinguishes Prostate Cancer from Benign Prostate Hyperplasia and Healthy Men. CANCER RESEARCH 62(13), 3609-14.	zh_TW
dc.relation.reference (參考文獻)	Baggerly, K. A., Morris, J. S. and Coombes, K.R.(2004). Reproducibility of SELDI-TOF protein patterns in serum: comparing datasets from different experiments. Bioinformatics 20(5), 777-85.	zh_TW
dc.relation.reference (參考文獻)	Cortes, C. and Mohri, M.(2003). AUC Optimization vs. Error Rate Minimization. Advances in Neural Information Processing System, 15.	zh_TW
dc.relation.reference (參考文獻)	Conrads, T.P., Zhou, M., Petricoin, E.F.,Liotta,L. and Veenstra, T.D.(2003). Cancer diagnosis using proteomic patterns. Expert Rev Mol Diagn 3(4):411-20	zh_TW
dc.relation.reference (參考文獻)	Drucker, H., Christopher, J. C., Burges, Kaufman, L., Smola, A.J., Vapnik, V.(1996). Support Vector Regression Machines. Neural Information Processing Systems 9, 155-161	zh_TW
dc.relation.reference (參考文獻)	Green, D. M. and Swets, J. A. (1966). Signal Detection Theory and Psychophysics. John Wiley & Sons, New York.	zh_TW
dc.relation.reference (參考文獻)	Hanley, J.A. and McNeil, B. J.(1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143, 29-36.	zh_TW
dc.relation.reference (參考文獻)	Hutchens, T., and Yip, T. (1993). New desorption strategies for the mass spectrometric analysis of macromolecules. Rapid Communications in Mass Spectrometry 7, 576-580.	zh_TW
dc.relation.reference (參考文獻)	Kevin R. Coombes, John M. Koomen, Keith A. Baggerly, Jeffrey S. Morris, and Ryuji Kobayashi. Understanding the characteristics of mass spectrometry data through the use of simulation. Cancer Informatics 2005, 1(1) 41-52.	zh_TW
dc.relation.reference (參考文獻)	Li, J., Zhang, Z., Rosenzweig, J., Wang, Y.Y., Chan, D.W.(2002). Proteomics and bioinformatics approaches for identification of serum biomarkers to detect breast cancer. Clinical Chemistry 48, 1296-1304.	zh_TW
dc.relation.reference (參考文獻)	Lilien, R.H., Farid, H. and Donald, B.R.(2003). Probabilistic Disease Classification of Expression-Dependent Proteomic Data from Mass Spectrometry of Human Serum. Journal of Computational Biology 10(6), 925-946.	zh_TW
dc.relation.reference (參考文獻)	Lyons-Weiler, J., Pelikan, R., Zeh,H. J. , Whitcomb, D. C., Malehorn,D. E., Bigbee,W.L., and Hauskrecht M.( 2005).Assessing the Statistical Significance of the Achieved Classification Error of Classifiers Constructed using Serum Peptide Profiles, and a Prescription for Random Sampling Repeated Studies for Massive High-Throughput Genomic and Proteomic Studies. Cancer Informatics 1(1), 53-77.	zh_TW
dc.relation.reference (參考文獻)	Pontil, M., Rifkin, R. and Evgeniou, T.(1999). From Regression to Classification in Support Vector Machines. European Symposium on Artificial Neural Networks.	zh_TW
dc.relation.reference (參考文獻)	Petricoin, E.F., Ardekani, A.M., Hitt, B.A., Levine, P.J., Fusaro, V.A., Steinberg, S.M., Mills, G.B., Simone, C., Fishman, D.A., Kohn, E.C., Liotta, L.A.(2002). Use of proteomic patterns in serum to identify ovarian cancer. Lancet 359, 572-577.	zh_TW
dc.relation.reference (參考文獻)	Qu, Y., Adam, B.L., Thornquist, M., Potter, J.D., Thompson, M.L., Yasui, Y., Davis, J., Schellhammer,P. F., Cazares,L., Clements,M.A., Wright, Jr.G.L., and Feng, Z.(2003).Data Reduction Using a Discrete Wavelet Transform in Discriminant Analysis of Very High Dimensionality Data. Biometrics 59, 143–151.	zh_TW
dc.relation.reference (參考文獻)	Reddy, G. and Dalmasso E. A. (2003). SELDI ProteinChip Array Technology: Protein-Based Predictive Medicine and Drug Discovery Applications. Journal of Biomedicine and Biotechnology 4, 237-241	zh_TW
dc.relation.reference (參考文獻)	Tang, N., Tornatore, P. & Weinberger, S.R. (2004). Current developments in SELDI affinity technology. Mass Spec. Rev. 23, 34−44.	zh_TW
dc.relation.reference (參考文獻)	Vapnik, V. (1995) .The Nature of Statistical Learning Theory. Springer Verlag,.	zh_TW

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM