學術產出-學位論文

文章檢視/開啟

書目匯出

Google ScholarTM

政大圖書館

引文資訊

TAIR相關學術產出

題名 基於資料科學方法之巨量蛋白質功能預測
Applying Data Science to High-throughput Protein Function Prediction
作者 劉義瑋
Liu, Yi-Wei
貢獻者 廖文宏
Liao, Wen-Hung
劉義瑋
Liu, Yi-Wei
關鍵詞 蛋白質功能預測
機器學習
Protein function prediction
Machine learning
日期 2017
上傳時間 2-十月-2017 10:16:27 (UTC+8)
摘要 自人體基因組計畫與次世代定序的完成後,生物資料呈現爆炸性的成長,其中蛋白質序列也是大量發現的基因產物之一,然而蛋白質的功能檢測與標記極其耗時,因此存在大量已知序列卻不知其功能的蛋白質,在實驗前透過電腦先預測可能之功能,能夠幫助生物學家排定不同的蛋白質功能實驗順序,因而加快蛋白質功能標注的速度。基因本體論(GO)是一個被廣泛使用描述基因產物功能與性質的分類方法,分為生物途徑、細胞組件、分子功能三個分支,每個分支皆為一個由多個GO組成的階層樹。蛋白質功能預測為透過蛋白質序列預測該蛋白質所擁有的GO,因此可以視為一個多標籤的分類機器學習問題。我們提出一個基於序列同源性的機器學習預測框架,同時能夠結合蛋白質家族的資訊,並設計多種不同的投票方法解決多標籤的預測問題。
Biological data has grown explosively with the accomplishment of Human Genome Project and Next-generation sequencing. Annotating protein function with wet lab experiment is time-consuming, so many proteins’ functions are still unknown. Fortunately, computational function prediction can help wet lab formulate biological hypotheses and prioritize experiments. Gene Ontology (GO) is the framework for unifying the representation of gene function and classifying these functions into three domains namely, Biological Process Ontology, Cellular Component Ontology, and Molecular Function Ontology. Each domain is a hierarchical tree composed of labels known as GO terms. Protein function prediction can be considered as a multiple label classification problem, i.e., given a protein sequence, predict its GO terms. We proposed a machine learning framework to predict protein function based on its homology sequence structure, which is believed to contain protein family information and designed various voting mechanisms to resolve the multiple label prediction problem.
參考文獻 [1] Christophe Dessimoz and Nives Škunca. The Gene Ontology Handbook. Springer, 2016.
[2] Predrag Radivojac, Wyatt T Clark, Tal Ronnen Oron, Alexandra M Schnoes, Tobias Wittkop,
Artem Sokolov, Kiley Graim, Christopher Funk, Karin Verspoor, Asa Ben-Hur, et al.
A large-scale evaluation of computational protein function prediction. Nature methods,
10(3):221–227, 2013.
[3] Yuxiang Jiang, Tal Ronnen Oron, Wyatt T Clark, Asma R Bankapur, Daniel D’Andrea,
Rosalba Lepore, Christopher S Funk, Indika Kahanda, Karin M Verspoor, Asa Ben-Hur,
et al. An expanded evaluation of protein function prediction methods shows an improvement
in accuracy. Genome biology, 17(1):184, 2016.
[4] Jia-Ming Chang, Emily Chia-Yu Su, Allan Lo, Hua-Sheng Chiu, Ting-Yi Sung, and Wen-
Lian Hsu. Psldoc: Protein subcellular localization prediction based on gapped-dipeptides
and probabilistic latent semantic analysis. Proteins: Structure, Function, and Bioinformatics,
72(2):693–710, 2008.
[5] Jia-Ming Chang, Jean-Francois Taly, Ionas Erb, Ting-Yi Sung, Wen-Lian Hsu, Chuan Yi
Tang, Cedric Notredame, and Emily Chia-Yu Su. Efficient and interpretable prediction of
protein functional classes by correspondence analysis and compact set relations. PloS one,
8(10):e75542, 2013.
[6] Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, and David J Lipman.
Basic local alignment search tool. Journal of molecular biology, 215(3):403–410, 1990.
[7] Stephen F Altschul, Thomas L Madden, Alejandro A Schäffer, Jinghui Zhang, Zheng
Zhang, Webb Miller, and David J Lipman. Gapped blast and psi-blast: a new generation
of protein database search programs. Nucleic acids research, 25(17):3389–3402, 1997.
[8] Ian Sillitoe, Tony E Lewis, Alison Cuff, Sayoni Das, Paul Ashford, Natalie L Dawson,
Nicholas Furnham, Roman A Laskowski, David Lee, Jonathan G Lees, et al. Cath: comprehensive
structural and functional annotations for genome sequences. Nucleic acids
research, 43(D1):D376–D381, 2015.
[9] Christine A Orengo, AD Michie, S Jones, David T Jones, MB Swindells, and Janet M
Thornton. Cath–a hierarchic classification of protein domain structures. Structure, 5(8):
1093–1109, 1997.
[10] Sayoni Das, David Lee, Ian Sillitoe, Natalie L Dawson, Jonathan G Lees, and Christine A
Orengo. Functional classification of cath superfamilies: a domain-based approach for
protein function annotation. Bioinformatics, 31(21):3460–3467, 2015.
[11] Sayoni Das, Ian Sillitoe, David Lee, Jonathan G Lees, Natalie L Dawson, John Ward, and
Christine A Orengo. Cath funfhmmer web server: protein functional annotations using
functional family assignments. Nucleic acids research, 43(W1):W148–W153, 2015.
[12] Chin-Sheng Yu, Chih-Jen Lin, and Jenn-Kang Hwang. Predicting subcellular localization
of proteins for gram-negative bacteria by support vector machines based on n-peptide
compositions. Protein Science, 13(5):1402–1406, 2004.
[13] Keun-Joon Park and Minoru Kanehisa. Prediction of protein subcellular locations by support
vector machines using compositions of amino acids and amino acid pairs. Bioinformatics,
19(13):1656–1663, 2003.
[14] Thomas Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine
learning, 42(1-2):177–196, 2001.
[15] Yuxiang Jiang. Cafa2: Matlab evaluation codes for the 2nd cafa experiment. https:
//github.com/yuxjiang/CAFA2, 2016.
[16] Robert C Edgar. Search and clustering orders of magnitude faster than blast. Bioinformatics,
26(19):2460–2461, 2010.
描述 碩士
國立政治大學
資訊科學學系
104753013
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0104753013
資料類型 thesis
dc.contributor.advisor 廖文宏zh_TW
dc.contributor.advisor Liao, Wen-Hungen_US
dc.contributor.author (作者) 劉義瑋zh_TW
dc.contributor.author (作者) Liu, Yi-Weien_US
dc.creator (作者) 劉義瑋zh_TW
dc.creator (作者) Liu, Yi-Weien_US
dc.date (日期) 2017en_US
dc.date.accessioned 2-十月-2017 10:16:27 (UTC+8)-
dc.date.available 2-十月-2017 10:16:27 (UTC+8)-
dc.date.issued (上傳時間) 2-十月-2017 10:16:27 (UTC+8)-
dc.identifier (其他 識別碼) G0104753013en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/113294-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊科學學系zh_TW
dc.description (描述) 104753013zh_TW
dc.description.abstract (摘要) 自人體基因組計畫與次世代定序的完成後,生物資料呈現爆炸性的成長,其中蛋白質序列也是大量發現的基因產物之一,然而蛋白質的功能檢測與標記極其耗時,因此存在大量已知序列卻不知其功能的蛋白質,在實驗前透過電腦先預測可能之功能,能夠幫助生物學家排定不同的蛋白質功能實驗順序,因而加快蛋白質功能標注的速度。基因本體論(GO)是一個被廣泛使用描述基因產物功能與性質的分類方法,分為生物途徑、細胞組件、分子功能三個分支,每個分支皆為一個由多個GO組成的階層樹。蛋白質功能預測為透過蛋白質序列預測該蛋白質所擁有的GO,因此可以視為一個多標籤的分類機器學習問題。我們提出一個基於序列同源性的機器學習預測框架,同時能夠結合蛋白質家族的資訊,並設計多種不同的投票方法解決多標籤的預測問題。zh_TW
dc.description.abstract (摘要) Biological data has grown explosively with the accomplishment of Human Genome Project and Next-generation sequencing. Annotating protein function with wet lab experiment is time-consuming, so many proteins’ functions are still unknown. Fortunately, computational function prediction can help wet lab formulate biological hypotheses and prioritize experiments. Gene Ontology (GO) is the framework for unifying the representation of gene function and classifying these functions into three domains namely, Biological Process Ontology, Cellular Component Ontology, and Molecular Function Ontology. Each domain is a hierarchical tree composed of labels known as GO terms. Protein function prediction can be considered as a multiple label classification problem, i.e., given a protein sequence, predict its GO terms. We proposed a machine learning framework to predict protein function based on its homology sequence structure, which is believed to contain protein family information and designed various voting mechanisms to resolve the multiple label prediction problem.en_US
dc.description.tableofcontents 1 Introduction 1
1.1 Background 1
1.1.1 Proteins 1
1.1.2 Gene Ontology 2
1.1.3 Protein function prediction problem 4
1.1.4 The CAFA challenge 5
1.2 Objective 6
1.3 Our contributions 6
2 Related Work 8
2.1 Protein function annotation transferred from homologous proteins 8
2.2 Protein function annotation transferred from protein families 9
3 Methods 11
3.1 Feature representation by TFPSSM 11
3.1.1 Gapped-dipeptide 11
3.1.2 Position-specific scoring matrix 12
3.1.3 TFPSSM weighting scheme 13
3.2 Feature reduction by Principal Component Analysis 13
3.3 CATH information 14
3.4 Gene Ontology prediction by K-nearest neighbor algorithm and weighted voting 15
3.4.1 TFPSSM 1NN 15
3.4.2 TFPSSM Vote 15
3.4.2.1 Three branches of TFPSSM to determine K 15
3.4.2.2 Three voting weights to predict GO terms 16
3.4.3 TFPSSM CATH 16
3.4.4 Normalization of weighted voting 16
3.5 System architecture 18
4 Evaluation 19
4.1 Data sets 19
4.2 Five-fold cross-validation 20
4.3 Evaluation measures 21
4.4 Baseline models 22
4.4.1 Naïve method 22
4.4.2 BLAST method 23
4.5 Experiment design 23
4.5.1 Experiment 1: PCA 23
4.5.2 Experiment 2: K-nearest neighbors algorithm and weighted voting 23
4.5.3 Experiment 3: TFPSSM CATH 24
4.5.4 Experiment 4: Testing 24
5 Results and Discussion 25
5.1 Experiment 1: PCA 25
5.2 Experiment 2: K-nearest neighbors algorithm and weighted voting 27
5.2.1 Fixed-KNN 27
5.2.2 Dynamic-KNN 29
5.2.3 Hybrid-KNN 32
5.3 Experiment 3: TFPSSM-CATH 35
5.4 Summary of experiment results 38
5.4.1 CAFA2-Swiss and CAFA3-Swiss training dataset five-fold validation 38
5.4.2 CAFA2-Benchmark testing 41
5.4.3 CAFA3 preliminary results 44
6 Conclusion and Future Work 45
References 46
zh_TW
dc.format.extent 4396161 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0104753013en_US
dc.subject (關鍵詞) 蛋白質功能預測zh_TW
dc.subject (關鍵詞) 機器學習zh_TW
dc.subject (關鍵詞) Protein function predictionen_US
dc.subject (關鍵詞) Machine learningen_US
dc.title (題名) 基於資料科學方法之巨量蛋白質功能預測zh_TW
dc.title (題名) Applying Data Science to High-throughput Protein Function Predictionen_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) [1] Christophe Dessimoz and Nives Škunca. The Gene Ontology Handbook. Springer, 2016.
[2] Predrag Radivojac, Wyatt T Clark, Tal Ronnen Oron, Alexandra M Schnoes, Tobias Wittkop,
Artem Sokolov, Kiley Graim, Christopher Funk, Karin Verspoor, Asa Ben-Hur, et al.
A large-scale evaluation of computational protein function prediction. Nature methods,
10(3):221–227, 2013.
[3] Yuxiang Jiang, Tal Ronnen Oron, Wyatt T Clark, Asma R Bankapur, Daniel D’Andrea,
Rosalba Lepore, Christopher S Funk, Indika Kahanda, Karin M Verspoor, Asa Ben-Hur,
et al. An expanded evaluation of protein function prediction methods shows an improvement
in accuracy. Genome biology, 17(1):184, 2016.
[4] Jia-Ming Chang, Emily Chia-Yu Su, Allan Lo, Hua-Sheng Chiu, Ting-Yi Sung, and Wen-
Lian Hsu. Psldoc: Protein subcellular localization prediction based on gapped-dipeptides
and probabilistic latent semantic analysis. Proteins: Structure, Function, and Bioinformatics,
72(2):693–710, 2008.
[5] Jia-Ming Chang, Jean-Francois Taly, Ionas Erb, Ting-Yi Sung, Wen-Lian Hsu, Chuan Yi
Tang, Cedric Notredame, and Emily Chia-Yu Su. Efficient and interpretable prediction of
protein functional classes by correspondence analysis and compact set relations. PloS one,
8(10):e75542, 2013.
[6] Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, and David J Lipman.
Basic local alignment search tool. Journal of molecular biology, 215(3):403–410, 1990.
[7] Stephen F Altschul, Thomas L Madden, Alejandro A Schäffer, Jinghui Zhang, Zheng
Zhang, Webb Miller, and David J Lipman. Gapped blast and psi-blast: a new generation
of protein database search programs. Nucleic acids research, 25(17):3389–3402, 1997.
[8] Ian Sillitoe, Tony E Lewis, Alison Cuff, Sayoni Das, Paul Ashford, Natalie L Dawson,
Nicholas Furnham, Roman A Laskowski, David Lee, Jonathan G Lees, et al. Cath: comprehensive
structural and functional annotations for genome sequences. Nucleic acids
research, 43(D1):D376–D381, 2015.
[9] Christine A Orengo, AD Michie, S Jones, David T Jones, MB Swindells, and Janet M
Thornton. Cath–a hierarchic classification of protein domain structures. Structure, 5(8):
1093–1109, 1997.
[10] Sayoni Das, David Lee, Ian Sillitoe, Natalie L Dawson, Jonathan G Lees, and Christine A
Orengo. Functional classification of cath superfamilies: a domain-based approach for
protein function annotation. Bioinformatics, 31(21):3460–3467, 2015.
[11] Sayoni Das, Ian Sillitoe, David Lee, Jonathan G Lees, Natalie L Dawson, John Ward, and
Christine A Orengo. Cath funfhmmer web server: protein functional annotations using
functional family assignments. Nucleic acids research, 43(W1):W148–W153, 2015.
[12] Chin-Sheng Yu, Chih-Jen Lin, and Jenn-Kang Hwang. Predicting subcellular localization
of proteins for gram-negative bacteria by support vector machines based on n-peptide
compositions. Protein Science, 13(5):1402–1406, 2004.
[13] Keun-Joon Park and Minoru Kanehisa. Prediction of protein subcellular locations by support
vector machines using compositions of amino acids and amino acid pairs. Bioinformatics,
19(13):1656–1663, 2003.
[14] Thomas Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine
learning, 42(1-2):177–196, 2001.
[15] Yuxiang Jiang. Cafa2: Matlab evaluation codes for the 2nd cafa experiment. https:
//github.com/yuxjiang/CAFA2, 2016.
[16] Robert C Edgar. Search and clustering orders of magnitude faster than blast. Bioinformatics,
26(19):2460–2461, 2010.
zh_TW