學術產出-Theses

Article View/Open

Publication Export

Google ScholarTM

政大圖書館

Citation Infomation

題名 利用平滑化處理與參照控制HiC資料來優化找尋基因體拷貝數變異
Improve the identification of Copy Number Variation using Smoothing Strategy and Incorporating Control HiC Data
作者 陳韋翰
Chen, Wei-Han
貢獻者 張家銘
Chang, Jia-Ming
陳韋翰
Chen, Wei-Han
關鍵詞 基因體拷貝數變異
高通量染色體結構捕獲技術
全基因組定序
Copy Number Variation
HiC
Whole Genome Sequencing
日期 2022
上傳時間 2-Sep-2022 15:05:31 (UTC+8)
摘要 基因體拷貝數變異多存在於不正常細胞中,如:腫瘤細胞。針對該類細胞如何偵測基因體 拷貝數變異對序列資料來說非常重要,移除了這些序列相關的偏差值可以讓下游的分析更 為準確。基因體拷貝數變異的現象也會出現在HiC資料當中,因此HiC可以作為偵測基因 體拷貝數變異的材料,而HiNT為目前利用HiC找出基因體拷貝數變異的方法中最頂尖的; 但在HiNT的正規化步驟中存在著震盪現象,因此我們藉由增加平滑化的處理以及參照HiC 控制組資料來減少震盪現象並且提升HiNT的準確度;最終我們得到更高的斯皮爾曼相關 係數(0.868 對比 0.837)、成功地預測更多的基因體拷貝數變異、更高的精准度(0.800 對比 0.750)與召回率(0.324 對比 0.243)。除此之外,我們若選擇只使用了自身染色體 的HiC資料時,在準確度略減的情況下,可以有更快的運算時間(1小時對比6分鐘)。
Copy number variation (CNV) often exists in abnormal cells such as cancer. Detecting the CNV of these cell lines is crucial for sequencing data since it makes downstream analysis more correct thanks to removing sequencing bias. The phenomenon of CNV appears on HiC data, as well. Thus HiC can be a material to identify CNV where HiNT is the state-of-the-art method. However, there exists a fluctuation phenomenon in the normalization step of HiNT. In this work, we want to eliminate the fluctuation phenomenon and further improve the performance of HiNT by adding a smoothing procedure which is a mean filter technique, and using HiC of the control cell line in the normalization step. As a result, we achieve a higher Spearman Correlation Coefficient (0.868 v.s. 0.837), more consistent CNV segments, higher precision (0.8 v.s. 0.75), and recall (0.324 v.s. 0.243). Besides, we speed up the running time ten times faster by using only intra-chromosomal information without losing too much performance.
參考文獻 1. Rui Yin, Chee Keong Kwoh, Jie Zheng, Whole Genome Sequencing Analysis, Editor(s): Shoba Ranganathan, Michael Gribskov, Kenta Nakai, Christian Schönbach, Encyclopedia of Bioinformatics and Computational Biology, Academic Press, 2019, Pages 176-183,
2. Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, Amit I, Lajoie BR, Sabo PJ, Dorschner MO, Sandstrom R, Bernstein B, Bender MA, Groudine M, Gnirke A, Stamatoyannopoulos J, Mirny LA, Lander ES, Dekker J. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science. 2009 Oct 9;326(5950):289-93.
3. Sexton T, Yaffe E, Kenigsberg E, Bantignies F, Leblanc B, Hoichman M, Parrinello H, Tanay A, Cavalli G. Three-dimensional folding and functional organization principles of the Drosophila genome. Cell. 2012 Feb 3;148(3):458-72.
4. Johnson DS, Mortazavi A, Myers RM, Wold B. Genome-wide mapping of in vivo protein-DNA interactions. Science. 2007 Jun 8;316(5830):1497-502.
5. Ashoor H, Louis-Brennetot C, Janoueix-Lerosey I, Bajic VB, Boeva V. HMCan-diff: a method to detect changes in histone modifications in cells with different genetic characteristics. Nucleic Acids Res. 2017 May 5;45(8):e58.
6. Boeva V, Zinovyev A, Bleakley K, Vert JP, Janoueix-Lerosey I, Delattre O, Barillot E. Control-free calling of copy number alterations in deep-sequencing data using GC-content normalization. Bioinformatics. 2011 Jan 15;27(2):268-9.
7. Boeva V, Popova T, Bleakley K, Chiche P, Cappo J, Schleiermacher G, Janoueix-Lerosey I, Delattre O, Barillot E. Control-FREEC: a tool for assessing copy number and allelic content using next-generation sequencing data. Bioinformatics. 2012 Feb 1;28(3):423-5.
8. Abyzov A, Urban AE, Snyder M, Gerstein M "CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing." Genome Res. 2011 Jun;21(6):974-84.
9. Milovan Suvakov, Arijit Panda, Colin Diesh, Ian Holmes, Alexej Abyzov, CNVpytor: a tool for copy number variation detection and analysis from read depth and allele imbalance in whole-genome sequencing, GigaScience, 2011 Nov;10(11):giab074
10. Xi R, Lee S, Xia Y, Kim TM, Park PJ. Copy number analysis of whole-genome data using BIC-seq2 and its application to detection of cancer susceptibility variants. Nucleic Acids Res. 2016 Jul 27;44(13):6274-86.
11. Harewood L, Kishore K, Eldridge MD, Wingett S, Pearson D, Schoenfelder S, Collins VP, Fraser P. Hi-C as a tool for precise detection and characterisation of chromosomal rearrangements and copy number variation in human tumors. Genome Biol. 2017 Jun 27;18(1):125.
12. Chakraborty A, Ay F. Identification of copy number variations and translocations in cancer cells from Hi-C data. Bioinformatics. 2018 Jan 15;34(2):338-345.
13. Vidal E, le Dily F, Quilez J, Stadhouders R, Cuartero Y, Graf T, Marti-Renom MA, Beato M, Filion GJ. OneD: increasing reproducibility of HiC samples with abnormal karyotypes. Nucleic Acids Res. 2018 May 4;46(8):e49.
14. Khalil AIS, Muzaki SRBM, Chattopadhyay A, Sanyal A. Identification and utilization of copy number information for correcting Hi-C contact map of cancer cell lines. BMC Bioinformatics. 2020 Nov 7;21(1):506.
15. Wang, S., Lee, S., Chu, C. et al. HiNT: a computational method for detecting copy number variations and translocations from Hi-C data. Genome Biol 21, 73 (2020).
16. Rao SS, Huntley MH, Durand NC, Stamenova EK et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 2014 Dec 18;159(7):1665-80.
17. Razin SV, Gavrilov AA. Structural-Functional Domains of the Eukaryotic Genome. Biochemistry (Mosc). 2018 Apr;83(4):302-312.
18. John D. Hunter. Matplotlib: A 2D Graphics Environment. Computing in Science and Engg. 2007 May;9(3):90–95.
19. Durand NC, Robinson JT, Shamim MS, Machol I, Mesirov JP, Lander ES, Aiden EL. Juicebox Provides a Visualization System for Hi-C Contact Maps with Unlimited Zoom. Cell Syst. 2016 Jul;3(1):99-101.
20. Wood, S. N. Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. Journal of the Royal Statistical Society (B). 2011;73(1):3–36.
21. Arce, Gonzalo R. Nonlinear Signal Processing: A Statistical Approach. New Jersey, USA: Wiley. 2004 Nov. ISBN 0-471-67624-1.
描述 碩士
國立政治大學
資訊科學系
109753144
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0109753144
資料類型 thesis
dc.contributor.advisor 張家銘zh_TW
dc.contributor.advisor Chang, Jia-Mingen_US
dc.contributor.author (Authors) 陳韋翰zh_TW
dc.contributor.author (Authors) Chen, Wei-Hanen_US
dc.creator (作者) 陳韋翰zh_TW
dc.creator (作者) Chen, Wei-Hanen_US
dc.date (日期) 2022en_US
dc.date.accessioned 2-Sep-2022 15:05:31 (UTC+8)-
dc.date.available 2-Sep-2022 15:05:31 (UTC+8)-
dc.date.issued (上傳時間) 2-Sep-2022 15:05:31 (UTC+8)-
dc.identifier (Other Identifiers) G0109753144en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/141641-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊科學系zh_TW
dc.description (描述) 109753144zh_TW
dc.description.abstract (摘要) 基因體拷貝數變異多存在於不正常細胞中,如:腫瘤細胞。針對該類細胞如何偵測基因體 拷貝數變異對序列資料來說非常重要,移除了這些序列相關的偏差值可以讓下游的分析更 為準確。基因體拷貝數變異的現象也會出現在HiC資料當中,因此HiC可以作為偵測基因 體拷貝數變異的材料,而HiNT為目前利用HiC找出基因體拷貝數變異的方法中最頂尖的; 但在HiNT的正規化步驟中存在著震盪現象,因此我們藉由增加平滑化的處理以及參照HiC 控制組資料來減少震盪現象並且提升HiNT的準確度;最終我們得到更高的斯皮爾曼相關 係數(0.868 對比 0.837)、成功地預測更多的基因體拷貝數變異、更高的精准度(0.800 對比 0.750)與召回率(0.324 對比 0.243)。除此之外,我們若選擇只使用了自身染色體 的HiC資料時,在準確度略減的情況下,可以有更快的運算時間(1小時對比6分鐘)。zh_TW
dc.description.abstract (摘要) Copy number variation (CNV) often exists in abnormal cells such as cancer. Detecting the CNV of these cell lines is crucial for sequencing data since it makes downstream analysis more correct thanks to removing sequencing bias. The phenomenon of CNV appears on HiC data, as well. Thus HiC can be a material to identify CNV where HiNT is the state-of-the-art method. However, there exists a fluctuation phenomenon in the normalization step of HiNT. In this work, we want to eliminate the fluctuation phenomenon and further improve the performance of HiNT by adding a smoothing procedure which is a mean filter technique, and using HiC of the control cell line in the normalization step. As a result, we achieve a higher Spearman Correlation Coefficient (0.868 v.s. 0.837), more consistent CNV segments, higher precision (0.8 v.s. 0.75), and recall (0.324 v.s. 0.243). Besides, we speed up the running time ten times faster by using only intra-chromosomal information without losing too much performance.en_US
dc.description.tableofcontents 1. Introduction 1
1.1. Next-Generation Sequencing Technique 1
1.2. Copy Number Variation (CNV) 1
1.3. Identification of CNVs 2
1.3.1 Identification of CNVs by WGS 3
1.3.2 Identification of CNVs by HiC 3
1.3.3 Identification of CNVs by ChIP-seq 4
2. Related Works 5
2.1. BICseq2 5
2.2. HiNT 8
3. Methods 11
3.1. Experiment Overview 11
3.2. Experiment Materials 11
3.3. Replication of HiNT-CNV Workflow 11
3.4. Calculation of Log2 Copy Ratio 13
3.5. Ground Truth Generation 14
3.6. Metrics 14
3.7. Insufficiency of HiNT-CNV 15
3.8. Improvement Strategies 18
3.8.1. Smoothing Procedure 18
3.8.2. Incorporating Control HiC 19
3.8.3. Using Only Intra-chromosomal HiC 19
4. Results 20
4.1. Ground Truth 20
4.2. Smoothing Procedure 24
4.3. Incorporating Control HiC 28
4.4. Using Only Intra-chromosomal HiC 31
4.5. Combination of Smoothing Procedure and Incorporating Control HiC 35
5. Conclusion and Discussion 38
6. References 39
zh_TW
dc.format.extent 6227369 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0109753144en_US
dc.subject (關鍵詞) 基因體拷貝數變異zh_TW
dc.subject (關鍵詞) 高通量染色體結構捕獲技術zh_TW
dc.subject (關鍵詞) 全基因組定序zh_TW
dc.subject (關鍵詞) Copy Number Variationen_US
dc.subject (關鍵詞) HiCen_US
dc.subject (關鍵詞) Whole Genome Sequencingen_US
dc.title (題名) 利用平滑化處理與參照控制HiC資料來優化找尋基因體拷貝數變異zh_TW
dc.title (題名) Improve the identification of Copy Number Variation using Smoothing Strategy and Incorporating Control HiC Dataen_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) 1. Rui Yin, Chee Keong Kwoh, Jie Zheng, Whole Genome Sequencing Analysis, Editor(s): Shoba Ranganathan, Michael Gribskov, Kenta Nakai, Christian Schönbach, Encyclopedia of Bioinformatics and Computational Biology, Academic Press, 2019, Pages 176-183,
2. Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, Amit I, Lajoie BR, Sabo PJ, Dorschner MO, Sandstrom R, Bernstein B, Bender MA, Groudine M, Gnirke A, Stamatoyannopoulos J, Mirny LA, Lander ES, Dekker J. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science. 2009 Oct 9;326(5950):289-93.
3. Sexton T, Yaffe E, Kenigsberg E, Bantignies F, Leblanc B, Hoichman M, Parrinello H, Tanay A, Cavalli G. Three-dimensional folding and functional organization principles of the Drosophila genome. Cell. 2012 Feb 3;148(3):458-72.
4. Johnson DS, Mortazavi A, Myers RM, Wold B. Genome-wide mapping of in vivo protein-DNA interactions. Science. 2007 Jun 8;316(5830):1497-502.
5. Ashoor H, Louis-Brennetot C, Janoueix-Lerosey I, Bajic VB, Boeva V. HMCan-diff: a method to detect changes in histone modifications in cells with different genetic characteristics. Nucleic Acids Res. 2017 May 5;45(8):e58.
6. Boeva V, Zinovyev A, Bleakley K, Vert JP, Janoueix-Lerosey I, Delattre O, Barillot E. Control-free calling of copy number alterations in deep-sequencing data using GC-content normalization. Bioinformatics. 2011 Jan 15;27(2):268-9.
7. Boeva V, Popova T, Bleakley K, Chiche P, Cappo J, Schleiermacher G, Janoueix-Lerosey I, Delattre O, Barillot E. Control-FREEC: a tool for assessing copy number and allelic content using next-generation sequencing data. Bioinformatics. 2012 Feb 1;28(3):423-5.
8. Abyzov A, Urban AE, Snyder M, Gerstein M "CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing." Genome Res. 2011 Jun;21(6):974-84.
9. Milovan Suvakov, Arijit Panda, Colin Diesh, Ian Holmes, Alexej Abyzov, CNVpytor: a tool for copy number variation detection and analysis from read depth and allele imbalance in whole-genome sequencing, GigaScience, 2011 Nov;10(11):giab074
10. Xi R, Lee S, Xia Y, Kim TM, Park PJ. Copy number analysis of whole-genome data using BIC-seq2 and its application to detection of cancer susceptibility variants. Nucleic Acids Res. 2016 Jul 27;44(13):6274-86.
11. Harewood L, Kishore K, Eldridge MD, Wingett S, Pearson D, Schoenfelder S, Collins VP, Fraser P. Hi-C as a tool for precise detection and characterisation of chromosomal rearrangements and copy number variation in human tumors. Genome Biol. 2017 Jun 27;18(1):125.
12. Chakraborty A, Ay F. Identification of copy number variations and translocations in cancer cells from Hi-C data. Bioinformatics. 2018 Jan 15;34(2):338-345.
13. Vidal E, le Dily F, Quilez J, Stadhouders R, Cuartero Y, Graf T, Marti-Renom MA, Beato M, Filion GJ. OneD: increasing reproducibility of HiC samples with abnormal karyotypes. Nucleic Acids Res. 2018 May 4;46(8):e49.
14. Khalil AIS, Muzaki SRBM, Chattopadhyay A, Sanyal A. Identification and utilization of copy number information for correcting Hi-C contact map of cancer cell lines. BMC Bioinformatics. 2020 Nov 7;21(1):506.
15. Wang, S., Lee, S., Chu, C. et al. HiNT: a computational method for detecting copy number variations and translocations from Hi-C data. Genome Biol 21, 73 (2020).
16. Rao SS, Huntley MH, Durand NC, Stamenova EK et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 2014 Dec 18;159(7):1665-80.
17. Razin SV, Gavrilov AA. Structural-Functional Domains of the Eukaryotic Genome. Biochemistry (Mosc). 2018 Apr;83(4):302-312.
18. John D. Hunter. Matplotlib: A 2D Graphics Environment. Computing in Science and Engg. 2007 May;9(3):90–95.
19. Durand NC, Robinson JT, Shamim MS, Machol I, Mesirov JP, Lander ES, Aiden EL. Juicebox Provides a Visualization System for Hi-C Contact Maps with Unlimited Zoom. Cell Syst. 2016 Jul;3(1):99-101.
20. Wood, S. N. Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. Journal of the Royal Statistical Society (B). 2011;73(1):3–36.
21. Arce, Gonzalo R. Nonlinear Signal Processing: A Statistical Approach. New Jersey, USA: Wiley. 2004 Nov. ISBN 0-471-67624-1.
zh_TW
dc.identifier.doi (DOI) 10.6814/NCCU202201407en_US