深度學習‭, ‬卷積神經網路模型‭, ‬預測蛋白質序列質譜儀圖譜 | Publication

Publications-Theses

Article View/Open

pdf(12)

Publication Export

Google Scholar^TM

NCCU Library

Discovery System

Citation Infomation

Related Publications in TAIR

Simple Record
Full Record

題名	深度學習‭, ‬卷積神經網路模型‭, ‬預測蛋白質序列質譜儀圖譜 Predict MS2 spectrum based on protein sequence by Deep Convolutional Neural Networks
作者	林洋名 Lin, Yang-Ming
貢獻者	張家銘 Chang, Jia-Ming 林洋名 Lin, Yang-Ming
關鍵詞	胜肽卷積類神經網路質譜儀圖譜深度學習機器學習質譜儀 Peptide Depp convolutional neural network Deep learning Machine learning Mass spectrum Tandem mass spectrometry
日期	2018
上傳時間	1-Oct-2018 12:11:12 (UTC+8)
摘要	生物學家利用質譜儀，對於未知蛋白質樣品進行定量定性。樣品進入質譜儀器內部，經過一連串的激發游離、利用磁場分離不同胜肽或氨基酸、撞擊偵測器的過程，最後會得到一張質譜儀圖譜。質譜儀圖譜包含的訊息為荷質比的訊號強度，每一個氨基酸都會有專屬於自己的荷質比數值，透過各種不同強度的訊號多寡，可以確認各種氨基酸是否存在。因此，如果能夠預測蛋白質質譜儀圖譜上的訊號強度，那會使質譜儀在定性和定量更加有準確度。在這篇論文中，我們提出‭ ‬MS2CNN‭ ‬模型是以深度學習演算法為基礎，透過卷積網路架構學習質譜儀圖譜。我們訓練時採用的質譜圖譜，是來自美國國家標準暨技術研究院公開的資料集，而驗證時會使用另外一組由液相層析串聯式質譜儀所實驗而成，人類的質譜儀圖譜資料集，此份資料會額外獨立出來，不會參與訓練的過程。我們的模型在這組資料集的測試成果分別為：電荷數為2時，餘弦相似度座落在‭ ‬0.57‭ ‬到‭ ‬0.79‭ ‬以及電荷數為3時，餘弦相似度座落在‭ ‬0.59‭ ‬到‭ ‬0.74。交叉驗證的訓練過程，在訓練組、驗證組、測試組分別得到的餘弦相似度和皮爾森相關係數為‭ ‬0.93‭, ‬0.86‭, ‬0.83‭ ‬和‭ ‬0.91‭, ‬0.83‭, ‬0.79。而我們在獨立資料集獲得的餘弦相似度和皮爾森相關係數‭ ‬‭(‬0.69‭ ‬和‭ ‬0.64‭) ‬比起‭ ‬MS2PIP‭ ‬所得到的餘弦相似度和皮爾森相關係數‭ ‬‭(‬0.66‭ ‬和‭ ‬0.61‭) ‬還要好。最後結果顯示，我們的預測結果可以比現行工具‭ ‬MS2PIP‭ ‬預測來的精準，尤其是在胜肽長度小於19的時候。從結果讓我們發現到，只要結合夠多的資料用在深度學習的模型上，我們相信能夠改善在長度較長的胜肽序列的表現結果。  Mass spectrometry allows biologists to identify and quantify protein samples in the form of digested peptide sequences. Tandem mass spectrometry (MS2) provides a tool to match signal observations with the chemical process. A peak in MS2 spectrum indicates the presence of a peptide fragmented ion with a specific mass and charge. Thus, it is useful to develop the predictor of MS2 signal peak intensity. In this thesis, we proposed a regression model, MS2CNN, based on a deep learning algorithm - deep convolutional neural network. MS2CNN is trained on the National Institute of Standards and Technology MS2 spectrum dataset and evaluated on a publicly available independent test dataset of human HeLa cell lysate from LC-MS experiment. For this dataset, MS2CNN achieved a cosine similarity (COS) in the range of 0.57 and 0.79 for peptides of 2+ and a COS in the range of 0.59 and 0.74 for peptides of 3+. In five-fold cross-validation, the COS and PCC of training, validation and testing is 0.93, 0.86, 0.83 and 0.91, 0.83, 0.79, respectively. In independent set test, our model shows better COS and PCC (0.69 and 0.64) than the ones of MS2PIP (0.66 and 0.61). We showed that MS2CNN performs better than MS2PIP, specially in short peptide (i.e., sequence length less than 19). The results suggest incorporating more data for deep learning model for longer peptides can potentially improve the performance.
參考文獻	1. Arnold,R.J. et al. (2006) A machine learning approach to predicting peptide fragmentation spectra. Pac. Symp. Biocomput. Pac. Symp. Biocomput., 219–230. 2. Chollet, F. (2015) keras, GitHub. - References - Scientific Research Publishing. 3. Cock,P.J.A. et al. (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinforma. Oxf. Engl., 25, 1422–1423. 4. Degroeve,S. et al. (2015) MS2PIP prediction server: compute and visualize MS2 peak intensity predictions for CID and HCD fragmentation. Nucleic Acids Res., 43, W326–W330. 5. Degroeve,S. and Martens,L. (2013) MS2PIP: a tool for MS/MS peak intensity prediction. Bioinforma. Oxf. Engl., 29, 3199–3203. 6. Eidhammer,I. ed. (2007) Computational methods for mass spectrometry proteomics John Wiley & Sons, Chichester, England ; Hoboken, NJ. 7. Elias,J.E. et al. (2004) Intensity-based protein identification by machine learning from a library of tandem mass spectra. Nat. Biotechnol., 22, 214–219. 8. Gatto,L. and Christoforou,A. (2014) Using R and Bioconductor for proteomics data analysis. Biochim. Biophys. Acta, 1844, 42–51. 9. Goloborodko,A.A. et al. (2013) Pyteomics--a Python framework for exploratory data analysis and rapid software prototyping in proteomics. J. Am. Soc. Mass Spectrom., 24, 301–304. 10. Hultin-Rosenberg,L. et al. (2013) Defining, comparing, and improving iTRAQ quantification in mass spectrometry proteomics data. Mol. Cell. Proteomics MCP, 12, 2021–2031. 11. Kirik,U. et al. (2018) Improving peptide-spectrum matching by fragmentation prediction using Hidden Markov Models. bioRxiv. 12. Lawrence,R.T. et al. (2016) Plug-and-play analysis of the human phosphoproteome by targeted high-resolution mass spectrometry. Nat. Methods, 13, 431–434. 13. LeCun,Y. et al. (2015) Deep learning. Nature, 521, 436. 14. Lecun,Y. et al. (1998) Gradient-based learning applied to document recognition. Proc. IEEE, 86, 2278–2324. 15. Li,S. et al. (2011) On the Accuracy and Limits of Peptide Fragmentation Spectrum Prediction. Anal. Chem., 83, 790–796. 16. Pedregosa,F. et al. (2011) Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res., 12, 2825–2830. 17. Savojardo,C. et al. (2018) DeepSig: deep learning improves signal peptide detection in proteins. Bioinforma. Oxf. Engl., 34, 1690–1696. 18. Tsou,C.-C. et al. (2016) Untargeted, spectral library-free analysis of data-independent acquisition proteomics data generated using Orbitrap mass spectrometers. Proteomics, 16, 2257–2271. 19. Walt,S. van der et al. (2011) The NumPy Array: A Structure for Efficient Numerical Computation. Comput. Sci. Eng., 13, 22–30. 20. Zhang,Z. (2004) Prediction of low-energy collision-induced dissociation spectra of peptides. Anal. Chem., 76, 3908–3922. Zhang,Z. (2005) Prediction of Low-Energy Collision-Induced Dissociation Spectra of Peptides with Three or More Charges. Anal. Chem., 77, 6364–6373.
描述	碩士國立政治大學資訊科學系 105753032
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0105753032
資料類型	thesis

dc.contributor.advisor	張家銘	zh_TW
dc.contributor.advisor	Chang, Jia-Ming	en_US
dc.contributor.author (Authors)	林洋名	zh_TW
dc.contributor.author (Authors)	Lin, Yang-Ming	en_US
dc.creator (作者)	林洋名	zh_TW
dc.creator (作者)	Lin, Yang-Ming	en_US
dc.date (日期)	2018	en_US
dc.date.accessioned	1-Oct-2018 12:11:12 (UTC+8)	-
dc.date.available	1-Oct-2018 12:11:12 (UTC+8)	-
dc.date.issued (上傳時間)	1-Oct-2018 12:11:12 (UTC+8)	-
dc.identifier (Other Identifiers)	G0105753032	en_US
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/120261	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	資訊科學系	zh_TW
dc.description (描述)	105753032	zh_TW
dc.description.abstract (摘要)	生物學家利用質譜儀，對於未知蛋白質樣品進行定量定性。樣品進入質譜儀器內部，經過一連串的激發游離、利用磁場分離不同胜肽或氨基酸、撞擊偵測器的過程，最後會得到一張質譜儀圖譜。質譜儀圖譜包含的訊息為荷質比的訊號強度，每一個氨基酸都會有專屬於自己的荷質比數值，透過各種不同強度的訊號多寡，可以確認各種氨基酸是否存在。因此，如果能夠預測蛋白質質譜儀圖譜上的訊號強度，那會使質譜儀在定性和定量更加有準確度。在這篇論文中，我們提出‭ ‬MS2CNN‭ ‬模型是以深度學習演算法為基礎，透過卷積網路架構學習質譜儀圖譜。我們訓練時採用的質譜圖譜，是來自美國國家標準暨技術研究院公開的資料集，而驗證時會使用另外一組由液相層析串聯式質譜儀所實驗而成，人類的質譜儀圖譜資料集，此份資料會額外獨立出來，不會參與訓練的過程。我們的模型在這組資料集的測試成果分別為：電荷數為2時，餘弦相似度座落在‭ ‬0.57‭ ‬到‭ ‬0.79‭ ‬以及電荷數為3時，餘弦相似度座落在‭ ‬0.59‭ ‬到‭ ‬0.74。交叉驗證的訓練過程，在訓練組、驗證組、測試組分別得到的餘弦相似度和皮爾森相關係數為‭ ‬0.93‭, ‬0.86‭, ‬0.83‭ ‬和‭ ‬0.91‭, ‬0.83‭, ‬0.79。而我們在獨立資料集獲得的餘弦相似度和皮爾森相關係數‭ ‬‭(‬0.69‭ ‬和‭ ‬0.64‭) ‬比起‭ ‬MS2PIP‭ ‬所得到的餘弦相似度和皮爾森相關係數‭ ‬‭(‬0.66‭ ‬和‭ ‬0.61‭) ‬還要好。最後結果顯示，我們的預測結果可以比現行工具‭ ‬MS2PIP‭ ‬預測來的精準，尤其是在胜肽長度小於19的時候。從結果讓我們發現到，只要結合夠多的資料用在深度學習的模型上，我們相信能夠改善在長度較長的胜肽序列的表現結果。	zh_TW
dc.description.abstract (摘要)	Mass spectrometry allows biologists to identify and quantify protein samples in the form of digested peptide sequences. Tandem mass spectrometry (MS2) provides a tool to match signal observations with the chemical process. A peak in MS2 spectrum indicates the presence of a peptide fragmented ion with a specific mass and charge. Thus, it is useful to develop the predictor of MS2 signal peak intensity. In this thesis, we proposed a regression model, MS2CNN, based on a deep learning algorithm - deep convolutional neural network. MS2CNN is trained on the National Institute of Standards and Technology MS2 spectrum dataset and evaluated on a publicly available independent test dataset of human HeLa cell lysate from LC-MS experiment. For this dataset, MS2CNN achieved a cosine similarity (COS) in the range of 0.57 and 0.79 for peptides of 2+ and a COS in the range of 0.59 and 0.74 for peptides of 3+. In five-fold cross-validation, the COS and PCC of training, validation and testing is 0.93, 0.86, 0.83 and 0.91, 0.83, 0.79, respectively. In independent set test, our model shows better COS and PCC (0.69 and 0.64) than the ones of MS2PIP (0.66 and 0.61). We showed that MS2CNN performs better than MS2PIP, specially in short peptide (i.e., sequence length less than 19). The results suggest incorporating more data for deep learning model for longer peptides can potentially improve the performance.	en_US
dc.description.tableofcontents	Abstract i 摘要 ii Contents iii List of Figures iv List of Tables v Introduction 1 1.1 Background 1 1.1.1 Amino Acid, Peptide, Protein 1 1.1.2 Peptide Fragmentation Nomenclature 2 1.1.3 Mass spectrometry (MS) 3 1.1.4 Tandem mass spectrometry (MS2) 4 1.2 Identification tool strategies 5 1.2.1 Database search approach 5 1.2.2 Data driven approach 5 Related Works 6 2.1 PeptideART 6 2.2 MS2PIP 8 2.3 Deep learning approach 9 Methods 10 3.1 Dataset 10 3.1.1 Training data set 10 3.1.2 Independent test set 12 3.2 Data processing 14 3.2.1 De-duplicated spectrum 14 3.2.2 Feature engineering 16 3.3 MS2CNN Model 18 Evaluation 20 4.1 K-fold cross validation 20 4.2 Metrics 21 4.3 Evaluation methods 22 Results and Discussion 23 5.1 5-fold cross validation for determining convolutional layer 23 5.2 MS2CNN training result 28 5.3 Independent data set evaluation 33 5.4 Similarity with Training data and Independent set 46 Conclusion and Future work 48 Reference 49	zh_TW
dc.format.extent	3687431 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0105753032	en_US
dc.subject (關鍵詞)	胜肽	zh_TW
dc.subject (關鍵詞)	卷積類神經網路	zh_TW
dc.subject (關鍵詞)	質譜儀圖譜	zh_TW
dc.subject (關鍵詞)	深度學習	zh_TW
dc.subject (關鍵詞)	機器學習	zh_TW
dc.subject (關鍵詞)	質譜儀	zh_TW
dc.subject (關鍵詞)	Peptide	en_US
dc.subject (關鍵詞)	Depp convolutional neural network	en_US
dc.subject (關鍵詞)	Deep learning	en_US
dc.subject (關鍵詞)	Machine learning	en_US
dc.subject (關鍵詞)	Mass spectrum	en_US
dc.subject (關鍵詞)	Tandem mass spectrometry	en_US
dc.title (題名)	深度學習‭, ‬卷積神經網路模型‭, ‬預測蛋白質序列質譜儀圖譜	zh_TW
dc.title (題名)	Predict MS2 spectrum based on protein sequence by Deep Convolutional Neural Networks	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	1. Arnold,R.J. et al. (2006) A machine learning approach to predicting peptide fragmentation spectra. Pac. Symp. Biocomput. Pac. Symp. Biocomput., 219–230. 2. Chollet, F. (2015) keras, GitHub. - References - Scientific Research Publishing. 3. Cock,P.J.A. et al. (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinforma. Oxf. Engl., 25, 1422–1423. 4. Degroeve,S. et al. (2015) MS2PIP prediction server: compute and visualize MS2 peak intensity predictions for CID and HCD fragmentation. Nucleic Acids Res., 43, W326–W330. 5. Degroeve,S. and Martens,L. (2013) MS2PIP: a tool for MS/MS peak intensity prediction. Bioinforma. Oxf. Engl., 29, 3199–3203. 6. Eidhammer,I. ed. (2007) Computational methods for mass spectrometry proteomics John Wiley & Sons, Chichester, England ; Hoboken, NJ. 7. Elias,J.E. et al. (2004) Intensity-based protein identification by machine learning from a library of tandem mass spectra. Nat. Biotechnol., 22, 214–219. 8. Gatto,L. and Christoforou,A. (2014) Using R and Bioconductor for proteomics data analysis. Biochim. Biophys. Acta, 1844, 42–51. 9. Goloborodko,A.A. et al. (2013) Pyteomics--a Python framework for exploratory data analysis and rapid software prototyping in proteomics. J. Am. Soc. Mass Spectrom., 24, 301–304. 10. Hultin-Rosenberg,L. et al. (2013) Defining, comparing, and improving iTRAQ quantification in mass spectrometry proteomics data. Mol. Cell. Proteomics MCP, 12, 2021–2031. 11. Kirik,U. et al. (2018) Improving peptide-spectrum matching by fragmentation prediction using Hidden Markov Models. bioRxiv. 12. Lawrence,R.T. et al. (2016) Plug-and-play analysis of the human phosphoproteome by targeted high-resolution mass spectrometry. Nat. Methods, 13, 431–434. 13. LeCun,Y. et al. (2015) Deep learning. Nature, 521, 436. 14. Lecun,Y. et al. (1998) Gradient-based learning applied to document recognition. Proc. IEEE, 86, 2278–2324. 15. Li,S. et al. (2011) On the Accuracy and Limits of Peptide Fragmentation Spectrum Prediction. Anal. Chem., 83, 790–796. 16. Pedregosa,F. et al. (2011) Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res., 12, 2825–2830. 17. Savojardo,C. et al. (2018) DeepSig: deep learning improves signal peptide detection in proteins. Bioinforma. Oxf. Engl., 34, 1690–1696. 18. Tsou,C.-C. et al. (2016) Untargeted, spectral library-free analysis of data-independent acquisition proteomics data generated using Orbitrap mass spectrometers. Proteomics, 16, 2257–2271. 19. Walt,S. van der et al. (2011) The NumPy Array: A Structure for Efficient Numerical Computation. Comput. Sci. Eng., 13, 22–30. 20. Zhang,Z. (2004) Prediction of low-energy collision-induced dissociation spectra of peptides. Anal. Chem., 76, 3908–3922. Zhang,Z. (2005) Prediction of Low-Energy Collision-Induced Dissociation Spectra of Peptides with Three or More Charges. Anal. Chem., 77, 6364–6373.	zh_TW
dc.identifier.doi (DOI)	10.6814/THE.NCCU.CS.013.2018.B02	en_US

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM