Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

題名 基於Hi-C數據的病毒基因組組裝與宿主關聯分析方法改進
Improvement of viral gene assembly and host analysis methods based on Hi-C data
作者 李佳芬
Li, Jia-Fen
貢獻者 張家銘
Chang, Jia-Ming
李佳芬
Li, Jia-Fen
關鍵詞 Hi-C
分箱
微生物基因組
聚類演算法
圖神經網路
病毒與宿主關係
Hi-C
Binning
Microbial genomes
Clustering algorithms
Graph neural networks
Virus–host interactions
日期 2025
上傳時間 1-Sep-2025 16:56:58 (UTC+8)
摘要 在微生物群落研究中,病毒與宿主的關聯解析至關重要,而 Hi-C 技術提供了一種透過 DNA 片段的物理交互作用來推測病毒與宿主關係的方法。ViralCC 與 MetaCC 為近年發展的代表性 Hi-C 數據處理工具,分別針對病毒與細菌(或其它原核生物)進行分箱(binning)分析,能夠從 Hi-C 相互作用矩陣中組裝基因組、預測病毒與宿主配對關係,並進行微生物基因組重建。然而,這些工具在處理環境樣本時仍面臨諸多挑戰,包括組裝不完整、錯誤分箱率高、基因組污染度偏高,以及計算效率與擴展性受限等問題。因此,有必要針對 Hi-C 分箱流程進行進一步優化與擴充。本研究將針對 ViralCC 與 MetaCC 進行優化,提出一套改進的 Hi-C 微生物分箱分析方法,以 Hi-C 相互作用矩陣為核心,整合多種基因組特徵(如 GC 含量、重疊群長度、Hi-C 交互作用強度等),並設計新的動態聚類演算法,通過優化圖結構分析與機器學習技術,方法有 Leiden 與 Louvain 等社群偵測演算法之調參優化,並設計結合圖結構分析與圖神經網路(GNN)之動態聚類演算法,導入 GNN 自動學習圖中結構與特徵,進行嵌入式分群(embedding-based clustering),以提高微生物基因組重建之完整性並有效降低污染度。最終透過與 ViralCC 與 MetaCC 等現有方法進行比較,驗證優化後的效果。實驗結果顯示,ViralCC 雖能成功生成 525 個純病毒分箱,但無法處理宿主重疊群;MetaCC 所產生的 211 個分箱中,有高達 158 個(約 74.9%)為病毒與宿主混合分箱,顯示其分群策略產生明顯混淆。而本研究方法則有效將病毒與宿主重疊群分離,最終生成 88,792 個重疊群所對應之分箱,且無混合分箱產生,提升分箱品質與可信度,改善分群純度。關鍵詞:Hi-C、分箱、微生物基因組、聚類演算法、圖神經網路、病毒與宿主關係
Deciphering virus–host linkages is pivotal in microbiome research, and Hi-C proximity ligation enables inference of these associations from physical DNA contacts. Recent Hi-C binning tools, ViralCC and MetaCC, can assemble genomes, predict virus–host pairs, and reconstruct microbial genomes from Hi-C interaction matrices; however, environmental samples still pose challenges, including fragmented assemblies, high misbinning, elevated contamination, and limited computational scalability. We present an improved Hi-C microbial binning framework that centers on the Hi-C interaction matrix while integrating genomic features (GC content, contig length) and Hi-C contact strength. The method couples parameter-optimized community detection (Leiden and Louvain) with a dynamic clustering algorithm that fuses graph-structural analysis and a graph neural network (GNN) to learn embeddings for embedding-based clustering, aiming to boost completeness and reduce contamination. In benchmarking, ViralCC generated 525 pure viral bins but did not handle host contigs, whereas MetaCC produced 211 bins, of which 158 (74.9%) were mixed virus–host bins, indicating clustering confounding. Our approach effectively separated viral and host contigs, successfully binning 88,792 contigs with no mixed bins, thereby improving bin purity and reliability and strengthening downstream virus–host pairing. Keywords:Hi-C, binning, microbial genomes, clustering algorithms, graph neural networks, virus–host interactions
參考文獻 [1] Du, Y., Fuhrman, J.A. & Sun, F. (2023). ViralCC retrieves complete viral genomes and virus-host pairs from metagenomic Hi-C data. Nat Commun 14, 502. [2] Du, Y., Sun, F. (2023). MetaCC allows scalable and integrative analyses of both long-read and short-read metagenomic Hi-C data. Nat Commun 14, 6231. [3] Integra Biosciences, "Short Read vs. Long Read Sequencing," Integra Biosciences. [Online]. Available: https://www.integra-biosciences.com/global/en/blog/article/short-read-vs-long-read-sequencing. [4] dyxstat, "MetaCC: Scalable and Integrative Analyses of MetaHi-C Data," GitHub repository, 2023. [Online]. Available: https://github.com/dyxstat/MetaCC [5] Yoon SH, Ha SM, Lim J, Kwon S, Chun J. A large-scale evaluation of algorithms to calculate average nucleotide identity. Antonie Van Leeuwenhoek. 2017 Oct;110(10):1281-1286. doi: 10.1007/s10482-017-0844-4. Epub 2017 Feb 15. PMID: 28204908. [6] Traag, V. A., Waltman, L., & van Eck, N. J. (2019). From Louvain to Leiden: guaranteeing well-connected communities. Scientific Reports, 9(1), 5233. https://doi.org/10.1038/s41598-019-41695-z [7] Wikipedia contributors. (2024). Contig. Wikipedia. Retrieved from https://en.wikipedia.org/wiki/Contig [8] Wikipedia contributors. (2024). Adapter. Wikipedia. Retrieved from https://zh.wikipedia.org/zh-tw/%E8%A1%94%E6%8E%A5%E5%AD%90 [9] EMBnet. (2014). The contig: a concept in genome assembly. EMBnet.journal, 20(1), 20. https://journal.embnet.org/index.php/embnetjournal/article/view/200 [10] Chklovski, A. (2023). CheckM2: An enhanced framework for assessing genome quality using machine learning. GitHub Repository. Retrieved from: https://github.com/chklovski/CheckM2 [11] Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P., & Tyson, G. W. (2015). CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research, 25(7), 1043–1055. https://doi.org/10.1101/gr.186072.114 [12] ScienceDirect. (2024). Adjusted Rand Index. Retrieved from: https://www.sciencedirect.com/topics/computer-science/adjusted-rand-index [13] Wikipedia contributors. (2024). Rand Index. Wikipedia. Retrieved from: https://en.wikipedia.org/wiki/Rand_index [14] National Center for Biotechnology Information, "Sequence Read Archive," NCBI. [Online]. Available: https://www.ncbi.nlm.nih.gov/sra. [15] Simroux, “VirSorter: mining viral signal from microbial genomic data,” GitHub repository, https://github.com/simroux/VirSorter (accessed Jul. 13, 2025).
描述 碩士
國立政治大學
資訊科學系
112753103
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0112753103
資料類型 thesis
dc.contributor.advisor 張家銘zh_TW
dc.contributor.advisor Chang, Jia-Mingen_US
dc.contributor.author (Authors) 李佳芬zh_TW
dc.contributor.author (Authors) Li, Jia-Fenen_US
dc.creator (作者) 李佳芬zh_TW
dc.creator (作者) Li, Jia-Fenen_US
dc.date (日期) 2025en_US
dc.date.accessioned 1-Sep-2025 16:56:58 (UTC+8)-
dc.date.available 1-Sep-2025 16:56:58 (UTC+8)-
dc.date.issued (上傳時間) 1-Sep-2025 16:56:58 (UTC+8)-
dc.identifier (Other Identifiers) G0112753103en_US
dc.identifier.uri (URI) https://nccur.lib.nccu.edu.tw/handle/140.119/159412-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊科學系zh_TW
dc.description (描述) 112753103zh_TW
dc.description.abstract (摘要) 在微生物群落研究中,病毒與宿主的關聯解析至關重要,而 Hi-C 技術提供了一種透過 DNA 片段的物理交互作用來推測病毒與宿主關係的方法。ViralCC 與 MetaCC 為近年發展的代表性 Hi-C 數據處理工具,分別針對病毒與細菌(或其它原核生物)進行分箱(binning)分析,能夠從 Hi-C 相互作用矩陣中組裝基因組、預測病毒與宿主配對關係,並進行微生物基因組重建。然而,這些工具在處理環境樣本時仍面臨諸多挑戰,包括組裝不完整、錯誤分箱率高、基因組污染度偏高,以及計算效率與擴展性受限等問題。因此,有必要針對 Hi-C 分箱流程進行進一步優化與擴充。本研究將針對 ViralCC 與 MetaCC 進行優化,提出一套改進的 Hi-C 微生物分箱分析方法,以 Hi-C 相互作用矩陣為核心,整合多種基因組特徵(如 GC 含量、重疊群長度、Hi-C 交互作用強度等),並設計新的動態聚類演算法,通過優化圖結構分析與機器學習技術,方法有 Leiden 與 Louvain 等社群偵測演算法之調參優化,並設計結合圖結構分析與圖神經網路(GNN)之動態聚類演算法,導入 GNN 自動學習圖中結構與特徵,進行嵌入式分群(embedding-based clustering),以提高微生物基因組重建之完整性並有效降低污染度。最終透過與 ViralCC 與 MetaCC 等現有方法進行比較,驗證優化後的效果。實驗結果顯示,ViralCC 雖能成功生成 525 個純病毒分箱,但無法處理宿主重疊群;MetaCC 所產生的 211 個分箱中,有高達 158 個(約 74.9%)為病毒與宿主混合分箱,顯示其分群策略產生明顯混淆。而本研究方法則有效將病毒與宿主重疊群分離,最終生成 88,792 個重疊群所對應之分箱,且無混合分箱產生,提升分箱品質與可信度,改善分群純度。關鍵詞:Hi-C、分箱、微生物基因組、聚類演算法、圖神經網路、病毒與宿主關係zh_TW
dc.description.abstract (摘要) Deciphering virus–host linkages is pivotal in microbiome research, and Hi-C proximity ligation enables inference of these associations from physical DNA contacts. Recent Hi-C binning tools, ViralCC and MetaCC, can assemble genomes, predict virus–host pairs, and reconstruct microbial genomes from Hi-C interaction matrices; however, environmental samples still pose challenges, including fragmented assemblies, high misbinning, elevated contamination, and limited computational scalability. We present an improved Hi-C microbial binning framework that centers on the Hi-C interaction matrix while integrating genomic features (GC content, contig length) and Hi-C contact strength. The method couples parameter-optimized community detection (Leiden and Louvain) with a dynamic clustering algorithm that fuses graph-structural analysis and a graph neural network (GNN) to learn embeddings for embedding-based clustering, aiming to boost completeness and reduce contamination. In benchmarking, ViralCC generated 525 pure viral bins but did not handle host contigs, whereas MetaCC produced 211 bins, of which 158 (74.9%) were mixed virus–host bins, indicating clustering confounding. Our approach effectively separated viral and host contigs, successfully binning 88,792 contigs with no mixed bins, thereby improving bin purity and reliability and strengthening downstream virus–host pairing. Keywords:Hi-C, binning, microbial genomes, clustering algorithms, graph neural networks, virus–host interactionsen_US
dc.description.tableofcontents 第一章 緒論 1 第一節 研究背景 1 第二節 研究目標與概念框架 2 一、研究目標 2 二、概念框架 2 第二章 相關研究 3 第一節 常用分析方法之名詞解釋 3 一、重疊群(Contig) 3 二、Adapter 序列 3 三、標記基因(Marker Genes) 3 第二節 分箱與分群 4 一、分箱(Bin)與分群(Cluster)的概念與區別 4 二、圖分群方法應用:Louvain 與 Leiden 5 第三節 Hi-C 分箱工具 6 一、ViralCC(Du et al., 2023) 6 二、MetaCC(Du & Sun, 2023) 7 三、ViralCC 和 MetaCC 在基因組分箱方法的應用 8 第三章 方法 10 第一節 概覽 10 第二節 資料集 11 一、數據來源 11 二、數據處理流程 11 第三節 實驗架構 12 一、整體架構 12 二、Hi-C 圖建構與增強 14 三、演算法與技術選擇 14 四、基因組組裝準確性評估 16 第四章 程式碼與資料夾結構 18 第一節 主程式與相關模組說明 18 一、主程式(main.py) 18 二、data_loader.py 18 三、graph_utils.py 19 四、clustering.py 19 五、result_exporter.py 20 第二節 系統處理流程與架構說明 21 一、輸入資料格式 21 二、主流程步驟(main.py) 21 第五章 結果 23 第一節 實驗結果 23 一、重疊群與分箱結果分析 23 二、系統執行效能分析(運算時間與資源消耗) 24 第二節 分析結果 25 一、建構 Hi-C 圖 25 二、改進方法對組裝的影響 26 三、分群一致性比較(Adjusted Rand Index) 26 第三節 比較結果 28 一、基因組完整性(completeness)與污染度(contamination)評估 28 二、與 ViralCC / MetaCC 結果比較 31 三、過濾高污染原核生物(prokaryotic)分群結果分析與統計 32 第六章 討論與未來方向 34 一、研究貢獻 34 二、與 MetaCC 比較之探討 35 三、未來研究方向 35 第七章 參考文獻 37zh_TW
dc.format.extent 1539143 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0112753103en_US
dc.subject (關鍵詞) Hi-Czh_TW
dc.subject (關鍵詞) 分箱zh_TW
dc.subject (關鍵詞) 微生物基因組zh_TW
dc.subject (關鍵詞) 聚類演算法zh_TW
dc.subject (關鍵詞) 圖神經網路zh_TW
dc.subject (關鍵詞) 病毒與宿主關係zh_TW
dc.subject (關鍵詞) Hi-Cen_US
dc.subject (關鍵詞) Binningen_US
dc.subject (關鍵詞) Microbial genomesen_US
dc.subject (關鍵詞) Clustering algorithmsen_US
dc.subject (關鍵詞) Graph neural networksen_US
dc.subject (關鍵詞) Virus–host interactionsen_US
dc.title (題名) 基於Hi-C數據的病毒基因組組裝與宿主關聯分析方法改進zh_TW
dc.title (題名) Improvement of viral gene assembly and host analysis methods based on Hi-C dataen_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) [1] Du, Y., Fuhrman, J.A. & Sun, F. (2023). ViralCC retrieves complete viral genomes and virus-host pairs from metagenomic Hi-C data. Nat Commun 14, 502. [2] Du, Y., Sun, F. (2023). MetaCC allows scalable and integrative analyses of both long-read and short-read metagenomic Hi-C data. Nat Commun 14, 6231. [3] Integra Biosciences, "Short Read vs. Long Read Sequencing," Integra Biosciences. [Online]. Available: https://www.integra-biosciences.com/global/en/blog/article/short-read-vs-long-read-sequencing. [4] dyxstat, "MetaCC: Scalable and Integrative Analyses of MetaHi-C Data," GitHub repository, 2023. [Online]. Available: https://github.com/dyxstat/MetaCC [5] Yoon SH, Ha SM, Lim J, Kwon S, Chun J. A large-scale evaluation of algorithms to calculate average nucleotide identity. Antonie Van Leeuwenhoek. 2017 Oct;110(10):1281-1286. doi: 10.1007/s10482-017-0844-4. Epub 2017 Feb 15. PMID: 28204908. [6] Traag, V. A., Waltman, L., & van Eck, N. J. (2019). From Louvain to Leiden: guaranteeing well-connected communities. Scientific Reports, 9(1), 5233. https://doi.org/10.1038/s41598-019-41695-z [7] Wikipedia contributors. (2024). Contig. Wikipedia. Retrieved from https://en.wikipedia.org/wiki/Contig [8] Wikipedia contributors. (2024). Adapter. Wikipedia. Retrieved from https://zh.wikipedia.org/zh-tw/%E8%A1%94%E6%8E%A5%E5%AD%90 [9] EMBnet. (2014). The contig: a concept in genome assembly. EMBnet.journal, 20(1), 20. https://journal.embnet.org/index.php/embnetjournal/article/view/200 [10] Chklovski, A. (2023). CheckM2: An enhanced framework for assessing genome quality using machine learning. GitHub Repository. Retrieved from: https://github.com/chklovski/CheckM2 [11] Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P., & Tyson, G. W. (2015). CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research, 25(7), 1043–1055. https://doi.org/10.1101/gr.186072.114 [12] ScienceDirect. (2024). Adjusted Rand Index. Retrieved from: https://www.sciencedirect.com/topics/computer-science/adjusted-rand-index [13] Wikipedia contributors. (2024). Rand Index. Wikipedia. Retrieved from: https://en.wikipedia.org/wiki/Rand_index [14] National Center for Biotechnology Information, "Sequence Read Archive," NCBI. [Online]. Available: https://www.ncbi.nlm.nih.gov/sra. [15] Simroux, “VirSorter: mining viral signal from microbial genomic data,” GitHub repository, https://github.com/simroux/VirSorter (accessed Jul. 13, 2025).zh_TW