BigBigTree: 基於Nextflow框架利用分群串接法建立巨量同源基因演化樹

蔡漢龍; Tsai, Han-Lung

Please use this identifier to cite or link to this item: https://ah.lib.nccu.edu.tw/handle/140.119/131632

題名:	BigBigTree: 基於Nextflow框架利用分群串接法建立巨量同源基因演化樹 BigBigTree: a divide and concatenate strategy for the phylogenetic reconstruction of large orthologous datasets using Nextflow framework
作者:	蔡漢龍 Tsai, Han-Lung
貢獻者:	張家銘 Chang, Jia-Ming 蔡漢龍 Tsai, Han-Lung
關鍵詞:	基因樹演化樹 Nextflow 分群串接 Gene tree phylogenetic tree Nextflow divide and concatenate
日期:	2020
上傳時間:	2-Sep-2020
摘要:	演化樹（phylogenetic tree）是根據不同生物間的型態、構造、生理、生態、遺傳和基因序列等特徵，將生物做系統化的分類，做成各物種間演化、親緣關係的樹狀圖，從中我們可以了解到序列間推斷的演化歷史。由於次世代定序技術及第三代定序技術的發展，越來越多的基因資料可以取得，面對龐大的資料量，甚至是最快的方法都具有挑戰性。一些重要的多基因家族（如嗅覺受體）已無法通過最準確的方法—最大似然（Maximum likelihood）來構建系統發育樹。\n\n在本研究中，我們提出了BigBigTree，透過分群串接法將問題分解為較小的問題並獨立解決。這個方法依賴於在直系同源基因的大型數據集中，進行分群的能力，每群直系同源基因都使用一種典型方法來構建演化樹，並在第二階段處理樹的上層(超級樹)，從每棵子樹中選擇每種物種的一種蛋白質序列，對來自同一物種的所有蛋白質序列進行多重序列比對，最後依其直系同源關係將序列串接起來，用於建構超級樹。這個方法的優點是我們減少了要分類的序列數量，且不會丟失資訊，因為最後的串接序列代表所有的序列。BigBigTree可以有效地處理特定於譜系的重複，但不能處理基因水平轉移，它更適合分析大的真核生物家族，如激酶或嗅覺受體。\n\n我們利用真實數據及模擬數據對BigBigTree進行評估，並與RAxML v8.2.12、RAxML-ng 及 IQ-TREE2 比較結果。在大多數情況下，BigBigTree的執行時間比RAxML和RAxML-ng快。在拓樸精度方面，BigBigTree在模擬數據上展現比其他方法更好的性能，並在實際數據中獲得與其他方法接近的精度。BigBigTree的原始碼及docker容器可在https://github.com/jmchanglabtw/bigbigtree和https://hub.docker.com/r/changlabtw/bigbigtree中取得。 A phylogenetic tree is a branching diagram based on the similarities of creatures in morphology, structure, physiology, genetics, ecology, and genetic sequence. It shows an inferred evolutionary history among sequences. Thanks to the next-generation sequencing technique and the third-generation sequencing technique, more and more sequences have become available. This overwhelming amount of data is challenging, even the fastest methods. Some important multi-genetic families like olfactory receptors have become impossible to build a phylogenetic tree with the most accurate methods like Maximum Likelihood (ML).\n\nHere we show how a simple Divide and Concatenate strategy, BigBigTree, can be applied to this problem by breaking it down into smaller problems that are solved independently. Our approach relies on the ability to identify within large dataset clusters of orthologous genes. Each group of orthologous genes is used to build a phylogenetic tree using a typical approach. The upper level of the tree (super-tree) is resolved in a second stage. One protein per species is chosen from each subtree. All proteins from the same species are aligned together. The alignment used for building the super-tree results from concatenating all these alignments, where within-species paralogues appear in the same columns, and orthologues appear in the same row. The advantage is that we reduce the number of sequences to classify without losing information as all sequences are represented in the final alignment. This approach can efficiently deal with lineage-specific duplications, but not with lateral transfers. It is better suited for the analysis of large eukaryotic families like the kinases or the olfactory receptors.\n\nWe evaluated BigBigTree in simulation and real data sets against RAxML v8.2.12, RAxML-ng, and IQ-TREE2. BigBigTree is faster than RAxML and RAxML-ng in most cases. Regarding topology accuracy, BigBigTree shows better performance than others in simulation data and gets compatible accuracy with others in real data. The source code and docker of the method are available at https://github.com/jmchanglabtw/bigbigtree and https://hub.docker.com/r/changlabtw/bigbigtree, where the latter allows users one-click installation.
參考文獻:	1. Contributors to Wikimedia projects. Phylogenetic tree. 2002 Nov 20 [cited 2020 May 19]; Available from: https://en.wikipedia.org/wiki/Phylogenetic_tree\n2. BIL 106 - Lecture 4 [Internet]. [cited 2020 May 19]. Available from: http://www.bio.miami.edu/dana/106/106F05_4.html\n3. Larget BR, Kotha SK, Dewey CN, Ané C. BUCKy: gene tree/species tree reconciliation with Bayesian concordance analysis. Bioinformatics. 2010 Nov 15;26(22):2910–1.\n4. Mirarab S, Reaz R, Bayzid MS, Zimmermann T, Swenson MS, Warnow T. ASTRAL: genome-scale coalescent-based species tree estimation. Bioinformatics. 2014 Sep 1;30(17):i541–8.\n5. Mirarab S, Warnow T. ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics. 2015 Jun 15;31(12):i44–52.\n6. Vachaspati P, Warnow T. ASTRID: Accurate Species TRees from Internode Distances. BMC Genomics. 2015 Oct 2;16 Suppl 10:S3.\n7. Lemoine F, -B. Domelevo Entfellner J, Wilkinson E, Correia D, Dávila Felipe M, De Oliveira T, et al. Renewing Felsenstein’s phylogenetic bootstrap in the era of big data [Internet]. Vol. 556, Nature. 2018. p. 452–6. Available from: http://dx.doi.org/10.1038/s41586-018-0043-0\n8. Delsuc F, Brinkmann H, Philippe H. Phylogenomics and the reconstruction of the tree of life [Internet]. Vol. 6, Nature Reviews Genetics. 2005. p. 361–75. Available from: http://dx.doi.org/10.1038/nrg1603\n9. Rokas A, Williams BL, King N, Carroll SB. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature. 2003 Oct 23;425(6960):798–804.\n10. Ashkenazy H, Sela I, Levy Karin E, Landan G, Pupko T. Multiple Sequence Alignment Averaging Improves Phylogeny Reconstruction. Syst Biol. 2019 Jan 1;68(1):117–30.\n11. Chang J-M, Floden EW, Herrero J, Gascuel O, Di Tommaso P, Notredame C. Incorporating alignment uncertainty into Felsenstein’s phylogenetic bootstrap to improve its reliability [Internet]. Bioinformatics. 2019. Available from: http://dx.doi.org/10.1093/bioinformatics/btz082\n12. BLAST: Basic Local Alignment Search Tool [Internet]. [cited 2020 May 19]. Available from: https://blast.ncbi.nlm.nih.gov/Blast.cgi\n13. hcluster [Internet]. PyPI. [cited 2020 May 19]. Available from: https://pypi.org/project/hcluster/\n14. Notredame C, Higgins DG, Heringa J. T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol. 2000 Sep 8;302(1):205–17.\n15. Katoh K, Misawa K, Kuma K-I, Miyata T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 2002 Jul 15;30(14):3059–66.\n16. TreeSoft: TreeBeST [Internet]. [cited 2020 May 19]. Available from: http://treesoft.sourceforge.net/treebest.shtml\n17. Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, et al. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Mol Biol Evol. 2020 May 1;37(5):1530–4.\n18. Nguyen L-T, Schmidt HA, von Haeseler A, Minh BQ. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol Biol Evol. 2015 Jan;32(1):268–74.\n19. Guindon S, Dufayard J-F, Lefort V, Anisimova M, Hordijk W, Gascuel O. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol. 2010 May;59(3):307–21.\n20. Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316–9.\n21. Di Tommaso Jean-Francois Taly Javier Herrero Cedric Notredame J-MCMMP. A divide and concatenate strategy for the phylogenetic reconstruction of large orthologous datasets. SMBE poster. 2012;\n22. Clustering Run - MCL Clusters - Microsporidia [Internet]. [cited 2020 May 19]. Available from: https://genome.jgi.doe.gov/clm/run/microsporidia-2017-01.1750;sjugmT?organismsGroup=microsporidia\n23. Mallo D, De Oliveira Martins L, Posada D. SimPhy: Phylogenomic Simulation of Gene, Locus, and Species Trees. Syst Biol. 2016 Mar;65(2):334–44.\n24. Fletcher W, Yang Z. INDELible: a flexible simulator of biological sequence evolution. Mol Biol Evol. 2009 Aug;26(8):1879–88.\n25. Lafond M, Meghdari Miardan M, Sankoff D. Accurate prediction of orthologs in the presence of divergence after duplication. Bioinformatics. 2018 Jul 1;34(13):i366–75.\n26. Robinson DF, Foulds LR. Comparison of phylogenetic trees [Internet]. Vol. 53, Mathematical Biosciences. 1981. p. 131–47. Available from: http://dx.doi.org/10.1016/0025-5564(81)90043-2\n27. Cardona G, Llabrés M, Rosselló F, Valiente G. Metrics for phylogenetic networks I: generalizations of the Robinson-Foulds metric. IEEE/ACM Trans Comput Biol Bioinform. 2009 Jan;6(1):46–61.\n28. Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014 May 1;30(9):1312–3.\n29. Kozlov AM, Darriba D, Flouri T, Morel B, Stamatakis A. RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics. 2019 Nov 1;35(21):4453–5.\n30. Jones DT, Taylor WR, Thornton JM. The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci. 1992 Jun;8(3):275–82.\n31. Chang J-M, Di Tommaso P, Notredame C. TCS: a new multiple sequence alignment reliability measure to estimate alignment accuracy and improve phylogenetic tree reconstruction. Mol Biol Evol. 2014 Jun;31(6):1625–37.\n32. Sonnhammer ELL, Koonin EV. Orthology, paralogy and proposed classification for paralog subtypes. Trends Genet. 2002 Dec;18(12):619–20.\n33. Nichio BTL, Marchaukoski JN, Raittz RT. New Tools in Orthology Analysis: A Brief Review of Promising Perspectives. Front Genet. 2017 Oct 31;8:165.\n34. Wang Y, Coleman-Derr D, Chen G, Gu YQ. OrthoVenn: a web server for genome wide comparison and annotation of orthologous clusters across multiple species. Nucleic Acids Res. 2015 Jul 1;43(W1):W78–84.\n35. Xu L, Dong Z, Fang L, Luo Y, Wei Z, Guo H, et al. OrthoVenn2: a web server for whole-genome comparison and annotation of orthologous clusters across multiple species [Internet]. Vol. 47, Nucleic Acids Research. 2019. p. W52–8. Available from: http://dx.doi.org/10.1093/nar/gkz333\n36. Emms DM, Kelly S. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biol. 2015 Aug 6;16:157.\n37. Emms DM, Kelly S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 2019 Nov 14;20(1):238.\n38. Altenhoff AM, Schneider A, Gonnet GH, Dessimoz C. OMA 2011: orthology inference among 1000 complete genomes. Nucleic Acids Res. 2011 Jan;39(Database issue):D289–94.\n39. Huerta-Cepas J, Serra F, Bork P. ETE 3: Reconstruction, Analysis, and Visualization of Phylogenomic Data. Mol Biol Evol. 2016 Jun;33(6):1635–8.
描述:	碩士國立政治大學資訊科學系 107753006
資料來源:	http://thesis.lib.nccu.edu.tw/record/#G0107753006
資料類型:	thesis
Appears in Collections:	學位論文

Files in This Item:

File	Description	Size	Format
300601.pdf		3.41 MB	Adobe PDF2	View/Open

Show full item record

Google Scholar^TM

Check

Files in This Item:

Google Scholar^TM

Altmetric

Altmetric

Files in This Item:

Google ScholarTM

Altmetric

Altmetric

Google Scholar^TM