Publications-Theses
Article View/Open
Publication Export
-
題名 利用平滑化處理與參照控制HiC資料來優化找尋基因體拷貝數變異
Improve the identification of Copy Number Variation using Smoothing Strategy and Incorporating Control HiC Data作者 陳韋翰
Chen, Wei-Han貢獻者 張家銘
Chang, Jia-Ming
陳韋翰
Chen, Wei-Han關鍵詞 基因體拷貝數變異
高通量染色體結構捕獲技術
全基因組定序
Copy Number Variation
HiC
Whole Genome Sequencing日期 2022 上傳時間 2-Sep-2022 15:05:31 (UTC+8) 摘要 基因體拷貝數變異多存在於不正常細胞中,如:腫瘤細胞。針對該類細胞如何偵測基因體 拷貝數變異對序列資料來說非常重要,移除了這些序列相關的偏差值可以讓下游的分析更 為準確。基因體拷貝數變異的現象也會出現在HiC資料當中,因此HiC可以作為偵測基因 體拷貝數變異的材料,而HiNT為目前利用HiC找出基因體拷貝數變異的方法中最頂尖的; 但在HiNT的正規化步驟中存在著震盪現象,因此我們藉由增加平滑化的處理以及參照HiC 控制組資料來減少震盪現象並且提升HiNT的準確度;最終我們得到更高的斯皮爾曼相關 係數(0.868 對比 0.837)、成功地預測更多的基因體拷貝數變異、更高的精准度(0.800 對比 0.750)與召回率(0.324 對比 0.243)。除此之外,我們若選擇只使用了自身染色體 的HiC資料時,在準確度略減的情況下,可以有更快的運算時間(1小時對比6分鐘)。
Copy number variation (CNV) often exists in abnormal cells such as cancer. Detecting the CNV of these cell lines is crucial for sequencing data since it makes downstream analysis more correct thanks to removing sequencing bias. The phenomenon of CNV appears on HiC data, as well. Thus HiC can be a material to identify CNV where HiNT is the state-of-the-art method. However, there exists a fluctuation phenomenon in the normalization step of HiNT. In this work, we want to eliminate the fluctuation phenomenon and further improve the performance of HiNT by adding a smoothing procedure which is a mean filter technique, and using HiC of the control cell line in the normalization step. As a result, we achieve a higher Spearman Correlation Coefficient (0.868 v.s. 0.837), more consistent CNV segments, higher precision (0.8 v.s. 0.75), and recall (0.324 v.s. 0.243). Besides, we speed up the running time ten times faster by using only intra-chromosomal information without losing too much performance.參考文獻 1. Rui Yin, Chee Keong Kwoh, Jie Zheng, Whole Genome Sequencing Analysis, Editor(s): Shoba Ranganathan, Michael Gribskov, Kenta Nakai, Christian Schönbach, Encyclopedia of Bioinformatics and Computational Biology, Academic Press, 2019, Pages 176-183,2. Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, Amit I, Lajoie BR, Sabo PJ, Dorschner MO, Sandstrom R, Bernstein B, Bender MA, Groudine M, Gnirke A, Stamatoyannopoulos J, Mirny LA, Lander ES, Dekker J. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science. 2009 Oct 9;326(5950):289-93.3. Sexton T, Yaffe E, Kenigsberg E, Bantignies F, Leblanc B, Hoichman M, Parrinello H, Tanay A, Cavalli G. Three-dimensional folding and functional organization principles of the Drosophila genome. Cell. 2012 Feb 3;148(3):458-72.4. Johnson DS, Mortazavi A, Myers RM, Wold B. Genome-wide mapping of in vivo protein-DNA interactions. Science. 2007 Jun 8;316(5830):1497-502.5. Ashoor H, Louis-Brennetot C, Janoueix-Lerosey I, Bajic VB, Boeva V. HMCan-diff: a method to detect changes in histone modifications in cells with different genetic characteristics. Nucleic Acids Res. 2017 May 5;45(8):e58.6. Boeva V, Zinovyev A, Bleakley K, Vert JP, Janoueix-Lerosey I, Delattre O, Barillot E. Control-free calling of copy number alterations in deep-sequencing data using GC-content normalization. Bioinformatics. 2011 Jan 15;27(2):268-9.7. Boeva V, Popova T, Bleakley K, Chiche P, Cappo J, Schleiermacher G, Janoueix-Lerosey I, Delattre O, Barillot E. Control-FREEC: a tool for assessing copy number and allelic content using next-generation sequencing data. Bioinformatics. 2012 Feb 1;28(3):423-5.8. Abyzov A, Urban AE, Snyder M, Gerstein M "CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing." Genome Res. 2011 Jun;21(6):974-84.9. Milovan Suvakov, Arijit Panda, Colin Diesh, Ian Holmes, Alexej Abyzov, CNVpytor: a tool for copy number variation detection and analysis from read depth and allele imbalance in whole-genome sequencing, GigaScience, 2011 Nov;10(11):giab07410. Xi R, Lee S, Xia Y, Kim TM, Park PJ. Copy number analysis of whole-genome data using BIC-seq2 and its application to detection of cancer susceptibility variants. Nucleic Acids Res. 2016 Jul 27;44(13):6274-86.11. Harewood L, Kishore K, Eldridge MD, Wingett S, Pearson D, Schoenfelder S, Collins VP, Fraser P. Hi-C as a tool for precise detection and characterisation of chromosomal rearrangements and copy number variation in human tumors. Genome Biol. 2017 Jun 27;18(1):125.12. Chakraborty A, Ay F. Identification of copy number variations and translocations in cancer cells from Hi-C data. Bioinformatics. 2018 Jan 15;34(2):338-345.13. Vidal E, le Dily F, Quilez J, Stadhouders R, Cuartero Y, Graf T, Marti-Renom MA, Beato M, Filion GJ. OneD: increasing reproducibility of HiC samples with abnormal karyotypes. Nucleic Acids Res. 2018 May 4;46(8):e49.14. Khalil AIS, Muzaki SRBM, Chattopadhyay A, Sanyal A. Identification and utilization of copy number information for correcting Hi-C contact map of cancer cell lines. BMC Bioinformatics. 2020 Nov 7;21(1):506.15. Wang, S., Lee, S., Chu, C. et al. HiNT: a computational method for detecting copy number variations and translocations from Hi-C data. Genome Biol 21, 73 (2020).16. Rao SS, Huntley MH, Durand NC, Stamenova EK et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 2014 Dec 18;159(7):1665-80.17. Razin SV, Gavrilov AA. Structural-Functional Domains of the Eukaryotic Genome. Biochemistry (Mosc). 2018 Apr;83(4):302-312.18. John D. Hunter. Matplotlib: A 2D Graphics Environment. Computing in Science and Engg. 2007 May;9(3):90–95.19. Durand NC, Robinson JT, Shamim MS, Machol I, Mesirov JP, Lander ES, Aiden EL. Juicebox Provides a Visualization System for Hi-C Contact Maps with Unlimited Zoom. Cell Syst. 2016 Jul;3(1):99-101.20. Wood, S. N. Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. Journal of the Royal Statistical Society (B). 2011;73(1):3–36.21. Arce, Gonzalo R. Nonlinear Signal Processing: A Statistical Approach. New Jersey, USA: Wiley. 2004 Nov. ISBN 0-471-67624-1. 描述 碩士
國立政治大學
資訊科學系
109753144資料來源 http://thesis.lib.nccu.edu.tw/record/#G0109753144 資料類型 thesis dc.contributor.advisor 張家銘 zh_TW dc.contributor.advisor Chang, Jia-Ming en_US dc.contributor.author (Authors) 陳韋翰 zh_TW dc.contributor.author (Authors) Chen, Wei-Han en_US dc.creator (作者) 陳韋翰 zh_TW dc.creator (作者) Chen, Wei-Han en_US dc.date (日期) 2022 en_US dc.date.accessioned 2-Sep-2022 15:05:31 (UTC+8) - dc.date.available 2-Sep-2022 15:05:31 (UTC+8) - dc.date.issued (上傳時間) 2-Sep-2022 15:05:31 (UTC+8) - dc.identifier (Other Identifiers) G0109753144 en_US dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/141641 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 資訊科學系 zh_TW dc.description (描述) 109753144 zh_TW dc.description.abstract (摘要) 基因體拷貝數變異多存在於不正常細胞中,如:腫瘤細胞。針對該類細胞如何偵測基因體 拷貝數變異對序列資料來說非常重要,移除了這些序列相關的偏差值可以讓下游的分析更 為準確。基因體拷貝數變異的現象也會出現在HiC資料當中,因此HiC可以作為偵測基因 體拷貝數變異的材料,而HiNT為目前利用HiC找出基因體拷貝數變異的方法中最頂尖的; 但在HiNT的正規化步驟中存在著震盪現象,因此我們藉由增加平滑化的處理以及參照HiC 控制組資料來減少震盪現象並且提升HiNT的準確度;最終我們得到更高的斯皮爾曼相關 係數(0.868 對比 0.837)、成功地預測更多的基因體拷貝數變異、更高的精准度(0.800 對比 0.750)與召回率(0.324 對比 0.243)。除此之外,我們若選擇只使用了自身染色體 的HiC資料時,在準確度略減的情況下,可以有更快的運算時間(1小時對比6分鐘)。 zh_TW dc.description.abstract (摘要) Copy number variation (CNV) often exists in abnormal cells such as cancer. Detecting the CNV of these cell lines is crucial for sequencing data since it makes downstream analysis more correct thanks to removing sequencing bias. The phenomenon of CNV appears on HiC data, as well. Thus HiC can be a material to identify CNV where HiNT is the state-of-the-art method. However, there exists a fluctuation phenomenon in the normalization step of HiNT. In this work, we want to eliminate the fluctuation phenomenon and further improve the performance of HiNT by adding a smoothing procedure which is a mean filter technique, and using HiC of the control cell line in the normalization step. As a result, we achieve a higher Spearman Correlation Coefficient (0.868 v.s. 0.837), more consistent CNV segments, higher precision (0.8 v.s. 0.75), and recall (0.324 v.s. 0.243). Besides, we speed up the running time ten times faster by using only intra-chromosomal information without losing too much performance. en_US dc.description.tableofcontents 1. Introduction 11.1. Next-Generation Sequencing Technique 11.2. Copy Number Variation (CNV) 11.3. Identification of CNVs 21.3.1 Identification of CNVs by WGS 31.3.2 Identification of CNVs by HiC 31.3.3 Identification of CNVs by ChIP-seq 42. Related Works 52.1. BICseq2 52.2. HiNT 83. Methods 113.1. Experiment Overview 113.2. Experiment Materials 113.3. Replication of HiNT-CNV Workflow 113.4. Calculation of Log2 Copy Ratio 133.5. Ground Truth Generation 143.6. Metrics 143.7. Insufficiency of HiNT-CNV 153.8. Improvement Strategies 183.8.1. Smoothing Procedure 183.8.2. Incorporating Control HiC 193.8.3. Using Only Intra-chromosomal HiC 194. Results 204.1. Ground Truth 204.2. Smoothing Procedure 244.3. Incorporating Control HiC 284.4. Using Only Intra-chromosomal HiC 314.5. Combination of Smoothing Procedure and Incorporating Control HiC 355. Conclusion and Discussion 386. References 39 zh_TW dc.format.extent 6227369 bytes - dc.format.mimetype application/pdf - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0109753144 en_US dc.subject (關鍵詞) 基因體拷貝數變異 zh_TW dc.subject (關鍵詞) 高通量染色體結構捕獲技術 zh_TW dc.subject (關鍵詞) 全基因組定序 zh_TW dc.subject (關鍵詞) Copy Number Variation en_US dc.subject (關鍵詞) HiC en_US dc.subject (關鍵詞) Whole Genome Sequencing en_US dc.title (題名) 利用平滑化處理與參照控制HiC資料來優化找尋基因體拷貝數變異 zh_TW dc.title (題名) Improve the identification of Copy Number Variation using Smoothing Strategy and Incorporating Control HiC Data en_US dc.type (資料類型) thesis en_US dc.relation.reference (參考文獻) 1. Rui Yin, Chee Keong Kwoh, Jie Zheng, Whole Genome Sequencing Analysis, Editor(s): Shoba Ranganathan, Michael Gribskov, Kenta Nakai, Christian Schönbach, Encyclopedia of Bioinformatics and Computational Biology, Academic Press, 2019, Pages 176-183,2. Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, Amit I, Lajoie BR, Sabo PJ, Dorschner MO, Sandstrom R, Bernstein B, Bender MA, Groudine M, Gnirke A, Stamatoyannopoulos J, Mirny LA, Lander ES, Dekker J. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science. 2009 Oct 9;326(5950):289-93.3. Sexton T, Yaffe E, Kenigsberg E, Bantignies F, Leblanc B, Hoichman M, Parrinello H, Tanay A, Cavalli G. Three-dimensional folding and functional organization principles of the Drosophila genome. Cell. 2012 Feb 3;148(3):458-72.4. Johnson DS, Mortazavi A, Myers RM, Wold B. Genome-wide mapping of in vivo protein-DNA interactions. Science. 2007 Jun 8;316(5830):1497-502.5. Ashoor H, Louis-Brennetot C, Janoueix-Lerosey I, Bajic VB, Boeva V. HMCan-diff: a method to detect changes in histone modifications in cells with different genetic characteristics. Nucleic Acids Res. 2017 May 5;45(8):e58.6. Boeva V, Zinovyev A, Bleakley K, Vert JP, Janoueix-Lerosey I, Delattre O, Barillot E. Control-free calling of copy number alterations in deep-sequencing data using GC-content normalization. Bioinformatics. 2011 Jan 15;27(2):268-9.7. Boeva V, Popova T, Bleakley K, Chiche P, Cappo J, Schleiermacher G, Janoueix-Lerosey I, Delattre O, Barillot E. Control-FREEC: a tool for assessing copy number and allelic content using next-generation sequencing data. Bioinformatics. 2012 Feb 1;28(3):423-5.8. Abyzov A, Urban AE, Snyder M, Gerstein M "CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing." Genome Res. 2011 Jun;21(6):974-84.9. Milovan Suvakov, Arijit Panda, Colin Diesh, Ian Holmes, Alexej Abyzov, CNVpytor: a tool for copy number variation detection and analysis from read depth and allele imbalance in whole-genome sequencing, GigaScience, 2011 Nov;10(11):giab07410. Xi R, Lee S, Xia Y, Kim TM, Park PJ. Copy number analysis of whole-genome data using BIC-seq2 and its application to detection of cancer susceptibility variants. Nucleic Acids Res. 2016 Jul 27;44(13):6274-86.11. Harewood L, Kishore K, Eldridge MD, Wingett S, Pearson D, Schoenfelder S, Collins VP, Fraser P. Hi-C as a tool for precise detection and characterisation of chromosomal rearrangements and copy number variation in human tumors. Genome Biol. 2017 Jun 27;18(1):125.12. Chakraborty A, Ay F. Identification of copy number variations and translocations in cancer cells from Hi-C data. Bioinformatics. 2018 Jan 15;34(2):338-345.13. Vidal E, le Dily F, Quilez J, Stadhouders R, Cuartero Y, Graf T, Marti-Renom MA, Beato M, Filion GJ. OneD: increasing reproducibility of HiC samples with abnormal karyotypes. Nucleic Acids Res. 2018 May 4;46(8):e49.14. Khalil AIS, Muzaki SRBM, Chattopadhyay A, Sanyal A. Identification and utilization of copy number information for correcting Hi-C contact map of cancer cell lines. BMC Bioinformatics. 2020 Nov 7;21(1):506.15. Wang, S., Lee, S., Chu, C. et al. HiNT: a computational method for detecting copy number variations and translocations from Hi-C data. Genome Biol 21, 73 (2020).16. Rao SS, Huntley MH, Durand NC, Stamenova EK et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 2014 Dec 18;159(7):1665-80.17. Razin SV, Gavrilov AA. Structural-Functional Domains of the Eukaryotic Genome. Biochemistry (Mosc). 2018 Apr;83(4):302-312.18. John D. Hunter. Matplotlib: A 2D Graphics Environment. Computing in Science and Engg. 2007 May;9(3):90–95.19. Durand NC, Robinson JT, Shamim MS, Machol I, Mesirov JP, Lander ES, Aiden EL. Juicebox Provides a Visualization System for Hi-C Contact Maps with Unlimited Zoom. Cell Syst. 2016 Jul;3(1):99-101.20. Wood, S. N. Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. Journal of the Royal Statistical Society (B). 2011;73(1):3–36.21. Arce, Gonzalo R. Nonlinear Signal Processing: A Statistical Approach. New Jersey, USA: Wiley. 2004 Nov. ISBN 0-471-67624-1. zh_TW dc.identifier.doi (DOI) 10.6814/NCCU202201407 en_US