基於Hi-C數據的病毒基因組組裝與宿主關聯分析方法改進 | Publication

Publications-Theses

Article View/Open

pdf(0)

Publication Export

Google Scholar^TM

題名	基於Hi-C數據的病毒基因組組裝與宿主關聯分析方法改進 Improvement of viral gene assembly and host analysis methods based on Hi-C data
作者	李佳芬 Li, Jia-Fen
貢獻者	張家銘 Chang, Jia-Ming 李佳芬 Li, Jia-Fen
關鍵詞	Hi-C 分箱微生物基因組聚類演算法圖神經網路病毒與宿主關係 Hi-C Binning Microbial genomes Clustering algorithms Graph neural networks Virus–host interactions
日期	2025
上傳時間	1-Sep-2025 16:56:58 (UTC+8)
摘要	在微生物群落研究中，病毒與宿主的關聯解析至關重要，而 Hi-C 技術提供了一種透過 DNA 片段的物理交互作用來推測病毒與宿主關係的方法。ViralCC 與 MetaCC 為近年發展的代表性 Hi-C 數據處理工具，分別針對病毒與細菌（或其它原核生物）進行分箱（binning）分析，能夠從 Hi-C 相互作用矩陣中組裝基因組、預測病毒與宿主配對關係，並進行微生物基因組重建。然而，這些工具在處理環境樣本時仍面臨諸多挑戰，包括組裝不完整、錯誤分箱率高、基因組污染度偏高，以及計算效率與擴展性受限等問題。因此，有必要針對 Hi-C 分箱流程進行進一步優化與擴充。本研究將針對 ViralCC 與 MetaCC 進行優化，提出一套改進的 Hi-C 微生物分箱分析方法，以 Hi-C 相互作用矩陣為核心，整合多種基因組特徵（如 GC 含量、重疊群長度、Hi-C 交互作用強度等），並設計新的動態聚類演算法，通過優化圖結構分析與機器學習技術，方法有 Leiden 與 Louvain 等社群偵測演算法之調參優化，並設計結合圖結構分析與圖神經網路（GNN）之動態聚類演算法，導入 GNN 自動學習圖中結構與特徵，進行嵌入式分群（embedding-based clustering），以提高微生物基因組重建之完整性並有效降低污染度。最終透過與 ViralCC 與 MetaCC 等現有方法進行比較，驗證優化後的效果。實驗結果顯示，ViralCC 雖能成功生成 525 個純病毒分箱，但無法處理宿主重疊群；MetaCC 所產生的 211 個分箱中，有高達 158 個（約 74.9%）為病毒與宿主混合分箱，顯示其分群策略產生明顯混淆。而本研究方法則有效將病毒與宿主重疊群分離，最終生成 88,792 個重疊群所對應之分箱，且無混合分箱產生，提升分箱品質與可信度，改善分群純度。關鍵詞：Hi-C、分箱、微生物基因組、聚類演算法、圖神經網路、病毒與宿主關係 Deciphering virus–host linkages is pivotal in microbiome research, and Hi-C proximity ligation enables inference of these associations from physical DNA contacts. Recent Hi-C binning tools, ViralCC and MetaCC, can assemble genomes, predict virus–host pairs, and reconstruct microbial genomes from Hi-C interaction matrices; however, environmental samples still pose challenges, including fragmented assemblies, high misbinning, elevated contamination, and limited computational scalability. We present an improved Hi-C microbial binning framework that centers on the Hi-C interaction matrix while integrating genomic features (GC content, contig length) and Hi-C contact strength. The method couples parameter-optimized community detection (Leiden and Louvain) with a dynamic clustering algorithm that fuses graph-structural analysis and a graph neural network (GNN) to learn embeddings for embedding-based clustering, aiming to boost completeness and reduce contamination. In benchmarking, ViralCC generated 525 pure viral bins but did not handle host contigs, whereas MetaCC produced 211 bins, of which 158 (74.9%) were mixed virus–host bins, indicating clustering confounding. Our approach effectively separated viral and host contigs, successfully binning 88,792 contigs with no mixed bins, thereby improving bin purity and reliability and strengthening downstream virus–host pairing. Keywords：Hi-C, binning, microbial genomes, clustering algorithms, graph neural networks, virus–host interactions
參考文獻	[1] Du, Y., Fuhrman, J.A. & Sun, F. (2023). ViralCC retrieves complete viral genomes and virus-host pairs from metagenomic Hi-C data. Nat Commun 14, 502. [2] Du, Y., Sun, F. (2023). MetaCC allows scalable and integrative analyses of both long-read and short-read metagenomic Hi-C data. Nat Commun 14, 6231. [3] Integra Biosciences, "Short Read vs. Long Read Sequencing," Integra Biosciences. [Online]. Available: https://www.integra-biosciences.com/global/en/blog/article/short-read-vs-long-read-sequencing. [4] dyxstat, "MetaCC: Scalable and Integrative Analyses of MetaHi-C Data," GitHub repository, 2023. [Online]. Available: https://github.com/dyxstat/MetaCC [5] Yoon SH, Ha SM, Lim J, Kwon S, Chun J. A large-scale evaluation of algorithms to calculate average nucleotide identity. Antonie Van Leeuwenhoek. 2017 Oct;110(10):1281-1286. doi: 10.1007/s10482-017-0844-4. Epub 2017 Feb 15. PMID: 28204908. [6] Traag, V. A., Waltman, L., & van Eck, N. J. (2019). From Louvain to Leiden: guaranteeing well-connected communities. Scientific Reports, 9(1), 5233. https://doi.org/10.1038/s41598-019-41695-z [7] Wikipedia contributors. (2024). Contig. Wikipedia. Retrieved from https://en.wikipedia.org/wiki/Contig [8] Wikipedia contributors. (2024). Adapter. Wikipedia. Retrieved from https://zh.wikipedia.org/zh-tw/%E8%A1%94%E6%8E%A5%E5%AD%90 [9] EMBnet. (2014). The contig: a concept in genome assembly. EMBnet.journal, 20(1), 20. https://journal.embnet.org/index.php/embnetjournal/article/view/200 [10] Chklovski, A. (2023). CheckM2: An enhanced framework for assessing genome quality using machine learning. GitHub Repository. Retrieved from: https://github.com/chklovski/CheckM2 [11] Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P., & Tyson, G. W. (2015). CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research, 25(7), 1043–1055. https://doi.org/10.1101/gr.186072.114 [12] ScienceDirect. (2024). Adjusted Rand Index. Retrieved from: https://www.sciencedirect.com/topics/computer-science/adjusted-rand-index [13] Wikipedia contributors. (2024). Rand Index. Wikipedia. Retrieved from: https://en.wikipedia.org/wiki/Rand_index [14] National Center for Biotechnology Information, "Sequence Read Archive," NCBI. [Online]. Available: https://www.ncbi.nlm.nih.gov/sra. [15] Simroux, “VirSorter: mining viral signal from microbial genomic data,” GitHub repository, https://github.com/simroux/VirSorter (accessed Jul. 13, 2025).
描述	碩士國立政治大學資訊科學系 112753103
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0112753103
資料類型	thesis

dc.contributor.advisor	張家銘	zh_TW
dc.contributor.advisor	Chang, Jia-Ming	en_US
dc.contributor.author (Authors)	李佳芬	zh_TW
dc.contributor.author (Authors)	Li, Jia-Fen	en_US
dc.creator (作者)	李佳芬	zh_TW
dc.creator (作者)	Li, Jia-Fen	en_US
dc.date (日期)	2025	en_US
dc.date.accessioned	1-Sep-2025 16:56:58 (UTC+8)	-
dc.date.available	1-Sep-2025 16:56:58 (UTC+8)	-
dc.date.issued (上傳時間)	1-Sep-2025 16:56:58 (UTC+8)	-
dc.identifier (Other Identifiers)	G0112753103	en_US
dc.identifier.uri (URI)	https://nccur.lib.nccu.edu.tw/handle/140.119/159412	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	資訊科學系	zh_TW
dc.description (描述)	112753103	zh_TW
dc.description.abstract (摘要)	在微生物群落研究中，病毒與宿主的關聯解析至關重要，而 Hi-C 技術提供了一種透過 DNA 片段的物理交互作用來推測病毒與宿主關係的方法。ViralCC 與 MetaCC 為近年發展的代表性 Hi-C 數據處理工具，分別針對病毒與細菌（或其它原核生物）進行分箱（binning）分析，能夠從 Hi-C 相互作用矩陣中組裝基因組、預測病毒與宿主配對關係，並進行微生物基因組重建。然而，這些工具在處理環境樣本時仍面臨諸多挑戰，包括組裝不完整、錯誤分箱率高、基因組污染度偏高，以及計算效率與擴展性受限等問題。因此，有必要針對 Hi-C 分箱流程進行進一步優化與擴充。本研究將針對 ViralCC 與 MetaCC 進行優化，提出一套改進的 Hi-C 微生物分箱分析方法，以 Hi-C 相互作用矩陣為核心，整合多種基因組特徵（如 GC 含量、重疊群長度、Hi-C 交互作用強度等），並設計新的動態聚類演算法，通過優化圖結構分析與機器學習技術，方法有 Leiden 與 Louvain 等社群偵測演算法之調參優化，並設計結合圖結構分析與圖神經網路（GNN）之動態聚類演算法，導入 GNN 自動學習圖中結構與特徵，進行嵌入式分群（embedding-based clustering），以提高微生物基因組重建之完整性並有效降低污染度。最終透過與 ViralCC 與 MetaCC 等現有方法進行比較，驗證優化後的效果。實驗結果顯示，ViralCC 雖能成功生成 525 個純病毒分箱，但無法處理宿主重疊群；MetaCC 所產生的 211 個分箱中，有高達 158 個（約 74.9%）為病毒與宿主混合分箱，顯示其分群策略產生明顯混淆。而本研究方法則有效將病毒與宿主重疊群分離，最終生成 88,792 個重疊群所對應之分箱，且無混合分箱產生，提升分箱品質與可信度，改善分群純度。關鍵詞：Hi-C、分箱、微生物基因組、聚類演算法、圖神經網路、病毒與宿主關係	zh_TW
dc.description.abstract (摘要)	Deciphering virus–host linkages is pivotal in microbiome research, and Hi-C proximity ligation enables inference of these associations from physical DNA contacts. Recent Hi-C binning tools, ViralCC and MetaCC, can assemble genomes, predict virus–host pairs, and reconstruct microbial genomes from Hi-C interaction matrices; however, environmental samples still pose challenges, including fragmented assemblies, high misbinning, elevated contamination, and limited computational scalability. We present an improved Hi-C microbial binning framework that centers on the Hi-C interaction matrix while integrating genomic features (GC content, contig length) and Hi-C contact strength. The method couples parameter-optimized community detection (Leiden and Louvain) with a dynamic clustering algorithm that fuses graph-structural analysis and a graph neural network (GNN) to learn embeddings for embedding-based clustering, aiming to boost completeness and reduce contamination. In benchmarking, ViralCC generated 525 pure viral bins but did not handle host contigs, whereas MetaCC produced 211 bins, of which 158 (74.9%) were mixed virus–host bins, indicating clustering confounding. Our approach effectively separated viral and host contigs, successfully binning 88,792 contigs with no mixed bins, thereby improving bin purity and reliability and strengthening downstream virus–host pairing. Keywords：Hi-C, binning, microbial genomes, clustering algorithms, graph neural networks, virus–host interactions	en_US
dc.description.tableofcontents	第一章緒論 1 第一節研究背景 1 第二節研究目標與概念框架 2 一、研究目標 2 二、概念框架 2 第二章相關研究 3 第一節常用分析方法之名詞解釋 3 一、重疊群（Contig） 3 二、Adapter 序列 3 三、標記基因（Marker Genes） 3 第二節分箱與分群 4 一、分箱(Bin)與分群(Cluster)的概念與區別 4 二、圖分群方法應用：Louvain 與 Leiden 5 第三節 Hi-C 分箱工具 6 一、ViralCC（Du et al., 2023） 6 二、MetaCC（Du & Sun, 2023） 7 三、ViralCC 和 MetaCC 在基因組分箱方法的應用 8 第三章方法 10 第一節概覽 10 第二節資料集 11 一、數據來源 11 二、數據處理流程 11 第三節實驗架構 12 一、整體架構 12 二、Hi-C 圖建構與增強 14 三、演算法與技術選擇 14 四、基因組組裝準確性評估 16 第四章程式碼與資料夾結構 18 第一節主程式與相關模組說明 18 一、主程式（main.py） 18 二、data_loader.py 18 三、graph_utils.py 19 四、clustering.py 19 五、result_exporter.py 20 第二節系統處理流程與架構說明 21 一、輸入資料格式 21 二、主流程步驟（main.py） 21 第五章結果 23 第一節實驗結果 23 一、重疊群與分箱結果分析 23 二、系統執行效能分析（運算時間與資源消耗） 24 第二節分析結果 25 一、建構 Hi-C 圖 25 二、改進方法對組裝的影響 26 三、分群一致性比較（Adjusted Rand Index） 26 第三節比較結果 28 一、基因組完整性（completeness）與污染度（contamination）評估 28 二、與 ViralCC / MetaCC 結果比較 31 三、過濾高污染原核生物（prokaryotic）分群結果分析與統計 32 第六章討論與未來方向 34 一、研究貢獻 34 二、與 MetaCC 比較之探討 35 三、未來研究方向 35 第七章參考文獻 37	zh_TW
dc.format.extent	1539143 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0112753103	en_US
dc.subject (關鍵詞)	Hi-C	zh_TW
dc.subject (關鍵詞)	分箱	zh_TW
dc.subject (關鍵詞)	微生物基因組	zh_TW
dc.subject (關鍵詞)	聚類演算法	zh_TW
dc.subject (關鍵詞)	圖神經網路	zh_TW
dc.subject (關鍵詞)	病毒與宿主關係	zh_TW
dc.subject (關鍵詞)	Hi-C	en_US
dc.subject (關鍵詞)	Binning	en_US
dc.subject (關鍵詞)	Microbial genomes	en_US
dc.subject (關鍵詞)	Clustering algorithms	en_US
dc.subject (關鍵詞)	Graph neural networks	en_US
dc.subject (關鍵詞)	Virus–host interactions	en_US
dc.title (題名)	基於Hi-C數據的病毒基因組組裝與宿主關聯分析方法改進	zh_TW
dc.title (題名)	Improvement of viral gene assembly and host analysis methods based on Hi-C data	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	[1] Du, Y., Fuhrman, J.A. & Sun, F. (2023). ViralCC retrieves complete viral genomes and virus-host pairs from metagenomic Hi-C data. Nat Commun 14, 502. [2] Du, Y., Sun, F. (2023). MetaCC allows scalable and integrative analyses of both long-read and short-read metagenomic Hi-C data. Nat Commun 14, 6231. [3] Integra Biosciences, "Short Read vs. Long Read Sequencing," Integra Biosciences. [Online]. Available: https://www.integra-biosciences.com/global/en/blog/article/short-read-vs-long-read-sequencing. [4] dyxstat, "MetaCC: Scalable and Integrative Analyses of MetaHi-C Data," GitHub repository, 2023. [Online]. Available: https://github.com/dyxstat/MetaCC [5] Yoon SH, Ha SM, Lim J, Kwon S, Chun J. A large-scale evaluation of algorithms to calculate average nucleotide identity. Antonie Van Leeuwenhoek. 2017 Oct;110(10):1281-1286. doi: 10.1007/s10482-017-0844-4. Epub 2017 Feb 15. PMID: 28204908. [6] Traag, V. A., Waltman, L., & van Eck, N. J. (2019). From Louvain to Leiden: guaranteeing well-connected communities. Scientific Reports, 9(1), 5233. https://doi.org/10.1038/s41598-019-41695-z [7] Wikipedia contributors. (2024). Contig. Wikipedia. Retrieved from https://en.wikipedia.org/wiki/Contig [8] Wikipedia contributors. (2024). Adapter. Wikipedia. Retrieved from https://zh.wikipedia.org/zh-tw/%E8%A1%94%E6%8E%A5%E5%AD%90 [9] EMBnet. (2014). The contig: a concept in genome assembly. EMBnet.journal, 20(1), 20. https://journal.embnet.org/index.php/embnetjournal/article/view/200 [10] Chklovski, A. (2023). CheckM2: An enhanced framework for assessing genome quality using machine learning. GitHub Repository. Retrieved from: https://github.com/chklovski/CheckM2 [11] Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P., & Tyson, G. W. (2015). CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research, 25(7), 1043–1055. https://doi.org/10.1101/gr.186072.114 [12] ScienceDirect. (2024). Adjusted Rand Index. Retrieved from: https://www.sciencedirect.com/topics/computer-science/adjusted-rand-index [13] Wikipedia contributors. (2024). Rand Index. Wikipedia. Retrieved from: https://en.wikipedia.org/wiki/Rand_index [14] National Center for Biotechnology Information, "Sequence Read Archive," NCBI. [Online]. Available: https://www.ncbi.nlm.nih.gov/sra. [15] Simroux, “VirSorter: mining viral signal from microbial genomic data,” GitHub repository, https://github.com/simroux/VirSorter (accessed Jul. 13, 2025).	zh_TW

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM