Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

題名 BigBigTree2 :基於Nextflow中DSL2語法改進BigBigTree的大規模基因樹建構方法
BigBigTree2: Advanced Large-Scale Gene Tree Construction, Improving BigBigTree with Nextflow DSL2 Integration
作者 邱顯安
Chiu, Hsien-An
貢獻者 張家銘
邱顯安
Chiu, Hsien-An
關鍵詞 大規模基因演化樹
多基因家族
分群串接
演化樹定位
Nextflow
日期 2024
上傳時間 4-Sep-2024 14:58:44 (UTC+8)
摘要 BigBigTree是由蔡漢龍在2020年發表的碩士論文,主要的方法是基於Nextflow框架,藉由分群的方式來建構演化樹,主要針對的資料為多基因家族 (如:果蠅嗅覺感受器),這類型的資料透過傳統最精確的最大似然方法(Maximum likelihood),在計算資源有限的情況下,似乎已經沒辦法準確的建構。 在本研究中,我們提出了BigBigTree2,自從Nextflow在2022年後全面移除了對DSL1語法的支持,BigBigTree2也需要將整個語法改寫成DSL2,而DSL2語法也使得BigBigTree2有更好的拓展性,可以方便快速的新增功能以及維護,並且我們基於原本的建樹流程中新增了一個步驟 - 演化樹定位 (Phylogenetic Placement),解決之前BigBigTree沒辦法處理輸入資料中有低同一性(low identity score)序列問題,雖然會額外增加運行時間,但可以透過計算的方式更好地把這些序列放到對於演化樹似然性分數(likelihood score)最高的位置。BigBigTree2即是一個使用分群串接的方法,並基於Nextflow架構實現平行運算的大規模基因演化樹建樹流程。 BigBigTree2在輸入為尚未比對的序列經過BLAST、序列比對、演化樹定位最終輸出樹型,相較現在主流的建樹方法在大約在六千筆序列資料快了兩倍,一萬筆序列資料的情況下快了將近三倍並且可以達到相似的似然性分數,而在序列數量越多的情況下越明顯。
參考文獻 Park, M.; Zaharias, P.; Warnow, T. Disjoint Tree Mergers for Large-Scale Maximum Likelihood Tree Estimation. Algorithms 2021, 14 Smirnov V, Warnow T. Unblended disjoint tree merging using GTM improves species tree estimation. BMC Genomics. 2020 Apr 蔡漢龍.BigBigTree: a divide and concatenate strategy for the phylogenetic reconstruction of large orthologous datasets using Nextflow framework.〔未出版之碩士論文〕。國立政治大學資訊科學系(2020) iTOL - https://itol.embl.de/ Kannan, L., Wheeler, W.C. Maximum Parsimony on Phylogenetic networks. Algorithms Mol Biol 7, 9 (2012). Roychoudhury, Arindam. “Consistency of the Maximum Likelihood Estimator of Evolutionary Tree.” arXiv: Populations and Evolution (2014) Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol. 1981 Notredame C, Higgins DG, Heringa J (2000) T-coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol 302:205–217. Katoh K, Misawa K, Kuma K, Miyata T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 2002 Jul Sievers F, Higgins DG. Clustal Omega for making accurate alignments of many protein sequences. Protein Sci. 2018 Jan;27(1) Zaharias Paul and Warnow Tandy.Recent progress on methods for estimating and updating large phylogenies. 2022 Phil. Trans. R. Soc. B37720210244 Difference of Orthology and Paralogy -http://petang.cgu.edu.tw/Bioinfomatics/MANUALS/NCBIblast/Orthology.html BLAST - https://blast.ncbi.nlm.nih.gov/Blast.cgi hcluster - https://pypi.python.org/pypi/hcluster TreeBeST - http://treesoft.sourceforge.net/treebest.shtml Guindon S., Dufayard J.F., Lefort V., Anisimova M., Hordijk W., Gascuel O. New Algorithms and Methods to Estimate Maximum-Likelihood Phylogenies: Assessing the Performance of PhyML 3.0. Systematic Biology, 59(3):307-21, 2010. Di Tommaso, P., Chatzou, M., Floden, E. W., Barja, P. P., Palumbo, E., & Notredame, C. (2017). Nextflow enables reproducible computational workflows. Nature Biotechnology, 35(4), 316–319. Ewels, P.A., Peltzer, A., Fillinger, S. et al. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol 38, 276–278 (2020). Björn E. Langer, Andreia Amaral, Marie-Odile Baudement,et al.the nf-core community. Empowering bioinformatics communities with Nextflow and nf-core,bioRxiv 2024.05.10.59291 Elizabeth Koning, Malachi Phillips, and Tandy Warnow. pplacerDC: a new scalable phylogenetic placement method. In Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (BCB '21). Matsen, F.A., Kodner, R.B. & Armbrust, E. pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics 11, 538 (2010). Alexandros Stamatakis, RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies, Bioinformatics, Volume 30, Issue 9, May 2014, Pages 1312–1313 Berger, S. A., Krompass, D., & Stamatakis, A. (2011). Performance, accuracy, and web server for evolutionary placement of short sequence reads under maximum likelihood. Systematic Biology, 60(3), 291-302. Pierre Barbera, Alexey M Kozlov, Lucas Czech, Benoit Morel, Diego Darriba, Tomáš Flouri, Alexandros Stamatakis 2019; EPA-ng: Massively Parallel Evolutionary Placement of Genetic Sequences, Systematic Biology, syy054 Metin Balaban and others, APPLES: Scalable Distance-Based EPAPhylogenetic Placement with or without Alignments, Systematic Biology, Volume 69, Issue 3, May7 2020, Pages 566–578 Treedist - https://pypi.org/project/treedist/ Robinson DF, Foulds LR Math Biosci 1981, Comparison of phylogenetic trees. Bui Quang Minh, Heiko A Schmidt, Olga Chernomor, Dominik Schrempf, Michael D Woodhams, Arndt von Haeseler, Robert Lanfear, IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era, Molecular Biology and Evolution, Volume 37, Issue 5, May 2020 Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 2009 Tata Consultancy Services (2024). TCS ADD™ - Advanced Drug Development Suite. Retrieved from [TCS Official Website](https://www.tcs.com) Emms, D.M. and Kelly, S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biology.2019. Altenhoff, A.M., Train, C.M., Seluanov, A., & Dessimoz, C. OMA standalone: orthology inference among public and custom genomes and transcriptomes. Genome Research.2021.
描述 碩士
國立政治大學
資訊科學系
110753110
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0110753110
資料類型 thesis
dc.contributor.advisor 張家銘zh_TW
dc.contributor.author (Authors) 邱顯安zh_TW
dc.contributor.author (Authors) Chiu, Hsien-Anen_US
dc.creator (作者) 邱顯安zh_TW
dc.creator (作者) Chiu, Hsien-Anen_US
dc.date (日期) 2024en_US
dc.date.accessioned 4-Sep-2024 14:58:44 (UTC+8)-
dc.date.available 4-Sep-2024 14:58:44 (UTC+8)-
dc.date.issued (上傳時間) 4-Sep-2024 14:58:44 (UTC+8)-
dc.identifier (Other Identifiers) G0110753110en_US
dc.identifier.uri (URI) https://nccur.lib.nccu.edu.tw/handle/140.119/153373-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊科學系zh_TW
dc.description (描述) 110753110zh_TW
dc.description.abstract (摘要) BigBigTree是由蔡漢龍在2020年發表的碩士論文,主要的方法是基於Nextflow框架,藉由分群的方式來建構演化樹,主要針對的資料為多基因家族 (如:果蠅嗅覺感受器),這類型的資料透過傳統最精確的最大似然方法(Maximum likelihood),在計算資源有限的情況下,似乎已經沒辦法準確的建構。 在本研究中,我們提出了BigBigTree2,自從Nextflow在2022年後全面移除了對DSL1語法的支持,BigBigTree2也需要將整個語法改寫成DSL2,而DSL2語法也使得BigBigTree2有更好的拓展性,可以方便快速的新增功能以及維護,並且我們基於原本的建樹流程中新增了一個步驟 - 演化樹定位 (Phylogenetic Placement),解決之前BigBigTree沒辦法處理輸入資料中有低同一性(low identity score)序列問題,雖然會額外增加運行時間,但可以透過計算的方式更好地把這些序列放到對於演化樹似然性分數(likelihood score)最高的位置。BigBigTree2即是一個使用分群串接的方法,並基於Nextflow架構實現平行運算的大規模基因演化樹建樹流程。 BigBigTree2在輸入為尚未比對的序列經過BLAST、序列比對、演化樹定位最終輸出樹型,相較現在主流的建樹方法在大約在六千筆序列資料快了兩倍,一萬筆序列資料的情況下快了將近三倍並且可以達到相似的似然性分數,而在序列數量越多的情況下越明顯。zh_TW
dc.description.tableofcontents 第一章 緒論 1 1.1 大規模演化樹 1 1.2 建立巨量同源基因演化樹的挑戰 2 1.3 BigBigTree 3 1.4 Nextflow 6 1.5 DSL1與DSL2的差別 7 1.6 低同一性(low identity)序列問題 8 1.7 演化樹定位方法文獻回顧 9 1.8 BigBigTree2 10 第二章 方法 11 2.1 概覽 11 2.2 處理低同一性列問題的方法 11 2.2.1 BigBigTree2中的工具選擇 11 2.2.2 BigBigTree2演化樹定位詳細流程 11 2.3 資料集 14 2.3.1 果蠅嗅覺受體基因資料集 14 2.3.2 PTHR24416基因家族資料集 15 2.4 工具 16 第三章 結果 17 3.1 TreeBeST建樹方法造成結果不一致 17 3.2 確認Nextflow DSL1與DSL2實驗結果是一致性 19 3.3 同源聚類的分佈對於樹的影響 21 3.4 BigBigTree2 執行結果 24 3.4.1 PTHR24416不同大小的資料集的結果 24 3.4.2 BigBigTree2與其它建樹方法比較 28 第四章 討論 31 4.1. 串接長序列處理方式 31 4.2. 低同一性序列風險 31 4.3. 同源基因特徵 31 第五章 結論 32 參考文獻 33zh_TW
dc.format.extent 11016191 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0110753110en_US
dc.subject (關鍵詞) 大規模基因演化樹zh_TW
dc.subject (關鍵詞) 多基因家族zh_TW
dc.subject (關鍵詞) 分群串接zh_TW
dc.subject (關鍵詞) 演化樹定位zh_TW
dc.subject (關鍵詞) Nextflowen_US
dc.title (題名) BigBigTree2 :基於Nextflow中DSL2語法改進BigBigTree的大規模基因樹建構方法zh_TW
dc.title (題名) BigBigTree2: Advanced Large-Scale Gene Tree Construction, Improving BigBigTree with Nextflow DSL2 Integrationen_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) Park, M.; Zaharias, P.; Warnow, T. Disjoint Tree Mergers for Large-Scale Maximum Likelihood Tree Estimation. Algorithms 2021, 14 Smirnov V, Warnow T. Unblended disjoint tree merging using GTM improves species tree estimation. BMC Genomics. 2020 Apr 蔡漢龍.BigBigTree: a divide and concatenate strategy for the phylogenetic reconstruction of large orthologous datasets using Nextflow framework.〔未出版之碩士論文〕。國立政治大學資訊科學系(2020) iTOL - https://itol.embl.de/ Kannan, L., Wheeler, W.C. Maximum Parsimony on Phylogenetic networks. Algorithms Mol Biol 7, 9 (2012). Roychoudhury, Arindam. “Consistency of the Maximum Likelihood Estimator of Evolutionary Tree.” arXiv: Populations and Evolution (2014) Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol. 1981 Notredame C, Higgins DG, Heringa J (2000) T-coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol 302:205–217. Katoh K, Misawa K, Kuma K, Miyata T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 2002 Jul Sievers F, Higgins DG. Clustal Omega for making accurate alignments of many protein sequences. Protein Sci. 2018 Jan;27(1) Zaharias Paul and Warnow Tandy.Recent progress on methods for estimating and updating large phylogenies. 2022 Phil. Trans. R. Soc. B37720210244 Difference of Orthology and Paralogy -http://petang.cgu.edu.tw/Bioinfomatics/MANUALS/NCBIblast/Orthology.html BLAST - https://blast.ncbi.nlm.nih.gov/Blast.cgi hcluster - https://pypi.python.org/pypi/hcluster TreeBeST - http://treesoft.sourceforge.net/treebest.shtml Guindon S., Dufayard J.F., Lefort V., Anisimova M., Hordijk W., Gascuel O. New Algorithms and Methods to Estimate Maximum-Likelihood Phylogenies: Assessing the Performance of PhyML 3.0. Systematic Biology, 59(3):307-21, 2010. Di Tommaso, P., Chatzou, M., Floden, E. W., Barja, P. P., Palumbo, E., & Notredame, C. (2017). Nextflow enables reproducible computational workflows. Nature Biotechnology, 35(4), 316–319. Ewels, P.A., Peltzer, A., Fillinger, S. et al. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol 38, 276–278 (2020). Björn E. Langer, Andreia Amaral, Marie-Odile Baudement,et al.the nf-core community. Empowering bioinformatics communities with Nextflow and nf-core,bioRxiv 2024.05.10.59291 Elizabeth Koning, Malachi Phillips, and Tandy Warnow. pplacerDC: a new scalable phylogenetic placement method. In Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (BCB '21). Matsen, F.A., Kodner, R.B. & Armbrust, E. pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics 11, 538 (2010). Alexandros Stamatakis, RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies, Bioinformatics, Volume 30, Issue 9, May 2014, Pages 1312–1313 Berger, S. A., Krompass, D., & Stamatakis, A. (2011). Performance, accuracy, and web server for evolutionary placement of short sequence reads under maximum likelihood. Systematic Biology, 60(3), 291-302. Pierre Barbera, Alexey M Kozlov, Lucas Czech, Benoit Morel, Diego Darriba, Tomáš Flouri, Alexandros Stamatakis 2019; EPA-ng: Massively Parallel Evolutionary Placement of Genetic Sequences, Systematic Biology, syy054 Metin Balaban and others, APPLES: Scalable Distance-Based EPAPhylogenetic Placement with or without Alignments, Systematic Biology, Volume 69, Issue 3, May7 2020, Pages 566–578 Treedist - https://pypi.org/project/treedist/ Robinson DF, Foulds LR Math Biosci 1981, Comparison of phylogenetic trees. Bui Quang Minh, Heiko A Schmidt, Olga Chernomor, Dominik Schrempf, Michael D Woodhams, Arndt von Haeseler, Robert Lanfear, IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era, Molecular Biology and Evolution, Volume 37, Issue 5, May 2020 Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 2009 Tata Consultancy Services (2024). TCS ADD™ - Advanced Drug Development Suite. Retrieved from [TCS Official Website](https://www.tcs.com) Emms, D.M. and Kelly, S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biology.2019. Altenhoff, A.M., Train, C.M., Seluanov, A., & Dessimoz, C. OMA standalone: orthology inference among public and custom genomes and transcriptomes. Genome Research.2021.zh_TW