學術產出-學位論文
文章檢視/開啟
書目匯出
-
題名 解碼 PC1 的力量:一種快速準確並基於共變異的 Hi-C 資料 A/B 染色體區室辨別方法
Decoding the Power of PC1: A Fast and Accurate Covariance-Based Method for A/B Compartment Identification in Hi-C Data作者 程至榮
Cheng, Zhi-Rong貢獻者 張家銘
Chang, Jia-Ming
程至榮
Cheng, Zhi-Rong關鍵詞 高通量染色體捕獲技術
染色質區室分析
主成份分析
Hi-C
Chromatin compartments analysis
Principal Component Analysis (PCA)日期 2024 上傳時間 4-九月-2024 15:00:57 (UTC+8) 摘要 在 Hi-C 皮爾森相關矩陣中識別 A 和 B 染色體區室的標準作法是基於主成份分析,然而其運作原理卻鮮少被討論。對於 Hi-C 皮爾森相關矩陣,我們提出其第一主成份的變異解釋率通常很高,並且該解釋率反應了 PC1 與皮爾森相關矩陣上之區室的匹配程度。此外,我們提出了一種啟發式算法,透過 Hi-C 皮爾森相關矩陣的共變異矩陣估計出第一主成份的型態,而不需要直接進行主成份分析。我們的啟發式算法可以使用隨機抽樣有效的實現以加快計算速度,為了解決高解析度下的記憶體瓶頸,我們使用一種最近發表的區室識別工具 POSSUMM 改進了算法,它接受稀疏的 Hi-C O/E 矩陣作為輸入。在我們的實驗中,我們的算法在時間或是記憶體使用上,其基準測試的表現優於使用 Scikit-learn 和 POSSUMM 等軟體工具的幂迭代法(Power iteration),同時與作為基準答案的第一主成份有高相似度。程式碼公開於下列網址 https://github.com/ZhiRongDev/HiCPEP。
The PCA-based method is the standard for identifying A and B compartments in the Hi-C Pearson matrix. However, the reason why it works is rarely discussed. For the Hi-C Pearson matrix, we propose that the explained variance ratio of PC1 is usually high, and the ratio will reflect how the PC1 matches the compartments on the Pearson matrix. Besides, we propose a heuristic algorithm to estimate the pattern of PC1 according to the Hi-C Pearson's covariance matrix without explicitly performing PCA. Our method can be implemented efficiently using random sampling techniques to accelerate calculations. To address the memory bottleneck at finer matrix resolutions, we adapt the algorithm using principles from POSSUMM, a recently published compartment identification tool that takes the sparse Hi-C O/E matrix as input. In our experiments, our algorithm outperforms Power iteration methods, such as those implemented in Scikit-learn and POSSUMM, in terms of the time or memory usage, while maintaining a high degree of similarity to the ground truth PC1. The code is freely available at https://github.com/ZhiRongDev/HiCPEP.參考文獻 [1] Erez Lieberman-Aiden*, Nynke L. van Berkum*, et al. “Comprehensive mapping of long-range interactions reveals folding principles of the human genome.”Science 326 (2009). GScholar Citations: 1626. Cover Article. [2] Dekker J, Rippe K, Dekker M, Kleckner N. Capturing chromosome conformation. Science. 2002 Feb 15;295(5558):1306-11. doi: 10.1126/science 1067799. PMID: 11847345. [3] Dixon, J.R., Selvaraj, S., Yue, F., Kim, A., Li, Y., Shen, Y., Hu, M., Liu, J.S., and Ren, B. (2012). Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376–380. [4] Rao, S., Huang, S.-C., Glenn, St., Hilaire, B., Engreitz, J. M., Perez, E. M., etal. (2017). Cohesin loss eliminates all loop domains. Cell 171, 305 – 320.e24. doi:10.1016/j.cell.2017.09.026 [5] Rao SS, Huntley MH, Durand NC, Stamenova EK, Bochkov ID, Robinson JT, Sanborn AL, Machol I, Omer AD, Lander ES, Aiden EL. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell. 2014 Dec 18;159(7):1665-80. doi: 10.1016/j.cell.2014.11.021. Epub 2014 Dec 11. Erratum in: Cell. 2015 Jul 30;162(3):687-8. PMID: 25497547; PMCID: PMC5635824. [6] Harris, H.L., Gu, H., Olshansky, M. et al. Chromatin alternates between A and B compartments at kilobase scale for subgenic organization. Nat Commun 14, 3303 (2023). https://doi.org/10.1038/s41467-023-38429-1 [7] Yaffe, E., and Tanay, A. (2011). Probabilistic modeling of Hi-C contact maps eliminates systematic biases to characterize global chromosomal architecture. Nat. Genet. 43 (11), 1059–1065. doi:10.1038/ng.947 [8] Servant, N., Varoquaux, N., Lajoie, B. R., Viara, E., Chen, C. J., Vert, J. P., et al. (2015). HiC-pro: An optimized and flexible pipeline for Hi-C data processing. Genome Biol. 16, 259. doi:10.1186/s13059-015-0831-x [9] Imakaev, M., Fudenberg, G., McCord, R. P., Naumova, N., Goloborodko, A., Lajoie, B.R., et al. (2012). Iterative correction of Hi-C data reveals hallmarks of chromosome organization. Nat. Methods 9 (10), 999–1003. doi:10.1038/nmeth.2148 [10] Knight, P. A., and Daniel, R. (2013). A fast algorithm for matrix balancing. IMA J. Numer. Analysis 33 (3), 1029–1047. doi:10.1093/imanum/drs019 [11] Kalluchi A, Harris HL, Reznicek TE, Rowley MJ. Considerations and caveats for analyzing chromatin compartments. Front Mol Biosci. 2023 Apr 5;10:1168562. doi: 10.3389/fmolb.2023.1168562. PMID: 37091873; PMCID: PMC10113542. [12] Jolliffe Ian T. and Cadima Jorge 2016 Principal component analysis: a review and recent developments Phil. Trans. R. Soc. A.3742015020220150202 http://doi.org/10.1098/rsta.2015.0202 [13] Kruse, K., Hug, C.B. & Vaquerizas, J.M. FAN-C: a feature-rich framework for the analysis and visualization of chromosome conformation capture data. Genome Biol 21, 303 (2020). https://doi.org/10.1186/s13059-020-02215-9 [14] Heinz S, Benner C, Spann N, Bertolino E et al. Simple Combinations of LineageDetermining Transcription Factors Prime cis-Regulatory Elements Required for Macrophage and B Cell Identities. Mol Cell 2010 May 28;38(4):576-589. PMID: 20513432 [15] Abdennur, N., and Mirny, L.A. (2020). Cooler: scalable storage for Hi-C data and other genomically labeled arrays. Bioinformatics. doi: 10.1093/bioinformatics/btz540. [16] Neva C. Durand, Muhammad S. Shamim, Ido Machol, Suhas S. P. Rao, Miriam H. Huntley, Eric S. Lander, and Erez Lieberman Aiden. ”Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments.” Cell Systems 3(1), 2016. [17] Zheng X, Zheng Y. CscoreTool: fast Hi-C compartment analysis at high resolution. Bioinformatics. 2018 May 1;34(9):1568-1570. doi: 10.1093/bioinformatics/btx802. PMID: 29244056; PMCID: PMC5925784. [18] Xiong, K., and Ma, J. (2019). Revealing Hi-C subcompartments by imputing interchromosomal chromatin interactions. Nat. Commun. 10 (1), 5069. doi:10.1038/s41467- 019-12954-4. [19] Wen, Z., Zhang, W., Zhong, Q., Xu, J., Hou, C., Qin, Z. S., et al. (2022). Extensive chromatin structure-function associations revealed by accurate 3D compartmentalization characterization. Front. Cell Dev. Biol. 10, 845118. doi:10. 3389/fcell.2022.845118 [20] van Berkum NL, Lieberman-Aiden E, Williams L, Imakaev M et al. Hi-C: a method to study the three-dimensional architecture of genomes. J Vis Exp 2010 May 6;(39). PMID: 20461051 [21] Sanborn AL, Rao SS, Huang SC, Durand NC et al. Chromatin extrusion explains key features of loop and domain formation in wild-type and engineered genomes. Proc Natl Acad Sci U S A 2015 Nov 24;112(47):E6456-65. PMID: 26499245 [22] Jonathon Shlens. A Tutorial on Principal Component Analysis. 2014. arXiv:1404.1100 [23] Pedregosa et al., JMLR 12, pp. 2825-2830, 2011. Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011. arXiv:1201.0490 [24] Baglama, J. & Lothar, R. Augmented implicitly restarted lanczos bidiagonalization methods. SIAM J. Sci. Comput 27, 19–42 (2005). https://doi.org/10.1137/04060593X [25] Free Software Foundation, I. (2014). GNU Datamash. Retrieved from https://www.gnu.org/software/datamash/ 描述 碩士
國立政治大學
資訊科學系
111753151資料來源 http://thesis.lib.nccu.edu.tw/record/#G0111753151 資料類型 thesis dc.contributor.advisor 張家銘 zh_TW dc.contributor.advisor Chang, Jia-Ming en_US dc.contributor.author (作者) 程至榮 zh_TW dc.contributor.author (作者) Cheng, Zhi-Rong en_US dc.creator (作者) 程至榮 zh_TW dc.creator (作者) Cheng, Zhi-Rong en_US dc.date (日期) 2024 en_US dc.date.accessioned 4-九月-2024 15:00:57 (UTC+8) - dc.date.available 4-九月-2024 15:00:57 (UTC+8) - dc.date.issued (上傳時間) 4-九月-2024 15:00:57 (UTC+8) - dc.identifier (其他 識別碼) G0111753151 en_US dc.identifier.uri (URI) https://nccur.lib.nccu.edu.tw/handle/140.119/153385 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 資訊科學系 zh_TW dc.description (描述) 111753151 zh_TW dc.description.abstract (摘要) 在 Hi-C 皮爾森相關矩陣中識別 A 和 B 染色體區室的標準作法是基於主成份分析,然而其運作原理卻鮮少被討論。對於 Hi-C 皮爾森相關矩陣,我們提出其第一主成份的變異解釋率通常很高,並且該解釋率反應了 PC1 與皮爾森相關矩陣上之區室的匹配程度。此外,我們提出了一種啟發式算法,透過 Hi-C 皮爾森相關矩陣的共變異矩陣估計出第一主成份的型態,而不需要直接進行主成份分析。我們的啟發式算法可以使用隨機抽樣有效的實現以加快計算速度,為了解決高解析度下的記憶體瓶頸,我們使用一種最近發表的區室識別工具 POSSUMM 改進了算法,它接受稀疏的 Hi-C O/E 矩陣作為輸入。在我們的實驗中,我們的算法在時間或是記憶體使用上,其基準測試的表現優於使用 Scikit-learn 和 POSSUMM 等軟體工具的幂迭代法(Power iteration),同時與作為基準答案的第一主成份有高相似度。程式碼公開於下列網址 https://github.com/ZhiRongDev/HiCPEP。 zh_TW dc.description.abstract (摘要) The PCA-based method is the standard for identifying A and B compartments in the Hi-C Pearson matrix. However, the reason why it works is rarely discussed. For the Hi-C Pearson matrix, we propose that the explained variance ratio of PC1 is usually high, and the ratio will reflect how the PC1 matches the compartments on the Pearson matrix. Besides, we propose a heuristic algorithm to estimate the pattern of PC1 according to the Hi-C Pearson's covariance matrix without explicitly performing PCA. Our method can be implemented efficiently using random sampling techniques to accelerate calculations. To address the memory bottleneck at finer matrix resolutions, we adapt the algorithm using principles from POSSUMM, a recently published compartment identification tool that takes the sparse Hi-C O/E matrix as input. In our experiments, our algorithm outperforms Power iteration methods, such as those implemented in Scikit-learn and POSSUMM, in terms of the time or memory usage, while maintaining a high degree of similarity to the ground truth PC1. The code is freely available at https://github.com/ZhiRongDev/HiCPEP. en_US dc.description.tableofcontents 1 Introduction 1 2 Materials and Methods 6 3 Results 25 4 Conclusion 35 5 Supplemental Information 36 Reference 37 zh_TW dc.format.extent 5459159 bytes - dc.format.mimetype application/pdf - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0111753151 en_US dc.subject (關鍵詞) 高通量染色體捕獲技術 zh_TW dc.subject (關鍵詞) 染色質區室分析 zh_TW dc.subject (關鍵詞) 主成份分析 zh_TW dc.subject (關鍵詞) Hi-C en_US dc.subject (關鍵詞) Chromatin compartments analysis en_US dc.subject (關鍵詞) Principal Component Analysis (PCA) en_US dc.title (題名) 解碼 PC1 的力量:一種快速準確並基於共變異的 Hi-C 資料 A/B 染色體區室辨別方法 zh_TW dc.title (題名) Decoding the Power of PC1: A Fast and Accurate Covariance-Based Method for A/B Compartment Identification in Hi-C Data en_US dc.type (資料類型) thesis en_US dc.relation.reference (參考文獻) [1] Erez Lieberman-Aiden*, Nynke L. van Berkum*, et al. “Comprehensive mapping of long-range interactions reveals folding principles of the human genome.”Science 326 (2009). GScholar Citations: 1626. Cover Article. [2] Dekker J, Rippe K, Dekker M, Kleckner N. Capturing chromosome conformation. Science. 2002 Feb 15;295(5558):1306-11. doi: 10.1126/science 1067799. PMID: 11847345. [3] Dixon, J.R., Selvaraj, S., Yue, F., Kim, A., Li, Y., Shen, Y., Hu, M., Liu, J.S., and Ren, B. (2012). Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376–380. [4] Rao, S., Huang, S.-C., Glenn, St., Hilaire, B., Engreitz, J. M., Perez, E. M., etal. (2017). Cohesin loss eliminates all loop domains. Cell 171, 305 – 320.e24. doi:10.1016/j.cell.2017.09.026 [5] Rao SS, Huntley MH, Durand NC, Stamenova EK, Bochkov ID, Robinson JT, Sanborn AL, Machol I, Omer AD, Lander ES, Aiden EL. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell. 2014 Dec 18;159(7):1665-80. doi: 10.1016/j.cell.2014.11.021. Epub 2014 Dec 11. Erratum in: Cell. 2015 Jul 30;162(3):687-8. PMID: 25497547; PMCID: PMC5635824. [6] Harris, H.L., Gu, H., Olshansky, M. et al. Chromatin alternates between A and B compartments at kilobase scale for subgenic organization. Nat Commun 14, 3303 (2023). https://doi.org/10.1038/s41467-023-38429-1 [7] Yaffe, E., and Tanay, A. (2011). Probabilistic modeling of Hi-C contact maps eliminates systematic biases to characterize global chromosomal architecture. Nat. Genet. 43 (11), 1059–1065. doi:10.1038/ng.947 [8] Servant, N., Varoquaux, N., Lajoie, B. R., Viara, E., Chen, C. J., Vert, J. P., et al. (2015). HiC-pro: An optimized and flexible pipeline for Hi-C data processing. Genome Biol. 16, 259. doi:10.1186/s13059-015-0831-x [9] Imakaev, M., Fudenberg, G., McCord, R. P., Naumova, N., Goloborodko, A., Lajoie, B.R., et al. (2012). Iterative correction of Hi-C data reveals hallmarks of chromosome organization. Nat. Methods 9 (10), 999–1003. doi:10.1038/nmeth.2148 [10] Knight, P. A., and Daniel, R. (2013). A fast algorithm for matrix balancing. IMA J. Numer. Analysis 33 (3), 1029–1047. doi:10.1093/imanum/drs019 [11] Kalluchi A, Harris HL, Reznicek TE, Rowley MJ. Considerations and caveats for analyzing chromatin compartments. Front Mol Biosci. 2023 Apr 5;10:1168562. doi: 10.3389/fmolb.2023.1168562. PMID: 37091873; PMCID: PMC10113542. [12] Jolliffe Ian T. and Cadima Jorge 2016 Principal component analysis: a review and recent developments Phil. Trans. R. Soc. A.3742015020220150202 http://doi.org/10.1098/rsta.2015.0202 [13] Kruse, K., Hug, C.B. & Vaquerizas, J.M. FAN-C: a feature-rich framework for the analysis and visualization of chromosome conformation capture data. Genome Biol 21, 303 (2020). https://doi.org/10.1186/s13059-020-02215-9 [14] Heinz S, Benner C, Spann N, Bertolino E et al. Simple Combinations of LineageDetermining Transcription Factors Prime cis-Regulatory Elements Required for Macrophage and B Cell Identities. Mol Cell 2010 May 28;38(4):576-589. PMID: 20513432 [15] Abdennur, N., and Mirny, L.A. (2020). Cooler: scalable storage for Hi-C data and other genomically labeled arrays. Bioinformatics. doi: 10.1093/bioinformatics/btz540. [16] Neva C. Durand, Muhammad S. Shamim, Ido Machol, Suhas S. P. Rao, Miriam H. Huntley, Eric S. Lander, and Erez Lieberman Aiden. ”Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments.” Cell Systems 3(1), 2016. [17] Zheng X, Zheng Y. CscoreTool: fast Hi-C compartment analysis at high resolution. Bioinformatics. 2018 May 1;34(9):1568-1570. doi: 10.1093/bioinformatics/btx802. PMID: 29244056; PMCID: PMC5925784. [18] Xiong, K., and Ma, J. (2019). Revealing Hi-C subcompartments by imputing interchromosomal chromatin interactions. Nat. Commun. 10 (1), 5069. doi:10.1038/s41467- 019-12954-4. [19] Wen, Z., Zhang, W., Zhong, Q., Xu, J., Hou, C., Qin, Z. S., et al. (2022). Extensive chromatin structure-function associations revealed by accurate 3D compartmentalization characterization. Front. Cell Dev. Biol. 10, 845118. doi:10. 3389/fcell.2022.845118 [20] van Berkum NL, Lieberman-Aiden E, Williams L, Imakaev M et al. Hi-C: a method to study the three-dimensional architecture of genomes. J Vis Exp 2010 May 6;(39). PMID: 20461051 [21] Sanborn AL, Rao SS, Huang SC, Durand NC et al. Chromatin extrusion explains key features of loop and domain formation in wild-type and engineered genomes. Proc Natl Acad Sci U S A 2015 Nov 24;112(47):E6456-65. PMID: 26499245 [22] Jonathon Shlens. A Tutorial on Principal Component Analysis. 2014. arXiv:1404.1100 [23] Pedregosa et al., JMLR 12, pp. 2825-2830, 2011. Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011. arXiv:1201.0490 [24] Baglama, J. & Lothar, R. Augmented implicitly restarted lanczos bidiagonalization methods. SIAM J. Sci. Comput 27, 19–42 (2005). https://doi.org/10.1137/04060593X [25] Free Software Foundation, I. (2014). GNU Datamash. Retrieved from https://www.gnu.org/software/datamash/ zh_TW