Publications-Theses
Article View/Open
Publication Export
-
Google ScholarTM
NCCU Library
Citation Infomation
Related Publications in TAIR
題名 GBactPro:基於機器學習方法對細菌啟動子進行跨物種預測
GBactPro: General Bacterial Promoter prediction across species using machine learning作者 高語謙
Kao, Yu-Chien貢獻者 張家銘
Chang, Jia-Ming
高語謙
Kao, Yu-Chien關鍵詞 細菌啟動子
機器學習
隨機森林模型
長短期記憶模型(LSTM)
Bacterial promoters
Machine learning
Random forest
Long short-term memory (LSTM)日期 2024 上傳時間 4-Sep-2024 14:59:32 (UTC+8) 摘要 啟動子為DNA上轉錄起始點上游特定基因片段,是調控DNA轉錄的重要位置,雖然已有許多啟動子預測工具,但大多只專注在少數物種,我們結合Promotech的跨物種預測,與台大周信宏教授開發的啟動子 scanning model,建立GBactPro,使用scanning model生成啟動子資料,訓練隨機森林模型以及深度學習模型找出每個區域的序列特徵,其中隨機森林模型可以透過相鄰區域的資訊學到更多的序列特徵,比單純計算序列結合能量的scanning model準確;深度學習模型使用1D-CNN及LSTM,利用LSTM可以學習長距離特徵的特性,不需透過scanning model 事先處理預測資料,也可正確地預測長序列中是否包含啟動子,我們的模型可以達到比Promotech更好的跨物種預測結果;此外使用GBactPro進行分區跨物種預測,在Minus10及Minus35區域的結果符合生物學上序列高度保留的特徵。最後在一些特定物種的特殊序列特徵,例如Alphaproteobacteria在 -7位置T出現的頻率較低,我們發現模型的預測結果在這些物種上會來得較差,符合生物學上的序列特徵。
Promoters are specific gene segments upstream of the transcription start site (TSS) on DNA and play an essential role in regulating DNA transcription. Although many promoter prediction tools exist, most focus on a limited number of species, especially E. coli. We have developed GBactPro by combining Promotech's cross-species prediction concept with the promoter scanning model developed by Professor Hsin-Hung David Chou from National Taiwan University. GBactPro uses the scanning model to generate data and identify sequence features in each region. The random forest model can learn more sequence features and is more accurate than the scanning model, which only calculates the sequence binding energy. We also trained deep learning models using 1D-CNN and LSTM. LSTM‘s ability to learn long-distance features predicts the presence of promoters in long sequences without the need for preprocessing via the scanning model. Our model achieves better cross-species prediction results than Promotech. Additionally, GBactPro performs region-specific cross-species predictions, with results in the -10 and -35 areas aligning with the biologically conserved sequence features. Finally, we observed that the model's performance is less effective for specific species with unique sequence characteristics, such as Alphaproteobacteria lacking T at position -7, which meets with the biological sequence features.參考文獻 1. Crick, F. H. (1958, January). On protein synthesis. In Symp Soc Exp Biol (Vol. 12, No. 138-63, p. 8). 2. Pribnow, D. (1975). Nucleotide sequence of an RNA polymerase binding site at an early T7 promoter. Proceedings of the National Academy of Sciences, 72(3), 784-788. 3. Myers, K. S., Noguera, D. R., & Donohue, T. J. (2021). Promoter architecture differences among alphaproteobacteria and other bacterial taxa. MSystems, 6(4), 10-1128. 4. Bhandari, N., Khare, S., Walambe, R., & Kotecha, K. (2021). Comparison of machine learning and deep learning techniques in promoter prediction across diverse species. PeerJ Computer Science, 7, e365. 5. Oubounyt, M., Louadi, Z., Tayara, H., & Chong, K. T. (2019). DeePromoter: robust promoter predictor using deep learning. Frontiers in genetics, 10, 286. 6. Zhang, M., Jia, C., Li, F., Li, C., Zhu, Y., Akutsu, T., ... & Song, J. (2022). Critical assessment of computational tools for prokaryotic and eukaryotic promoter prediction. Briefings in Bioinformatics, 23(2), bbab551. 7. Chevez-Guardado, R., & Peña-Castillo, L. (2021). Promotech: a general tool for bacterial promoter recognition. Genome Biology, 22, 1-16. 8. Ho, T. K. (1995, August). Random decision forests. In Proceedings of 3rd international conference on document analysis and recognition (Vol. 1, pp. 278-282). IEEE. 9. Dey, R., & Salem, F. M. (2017, August). Gate-variants of gated recurrent unit (GRU) neural networks. In 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS) (pp. 1597-1600). IEEE. 10. Medsker, L. R., & Jain, L. (2001). Recurrent neural networks. Design and Applications, 5(64-67), 2. 11. Zhang, M., Li, F., Marquez-Lago, T. T., Leier, A., Fan, C., Kwoh, C. K., ... & Jia, C. (2019). MULTiPly: a novel multi-layer predictor for discovering general and specific types of promoters. Bioinformatics, 35(17), 2957-2965. 12. Rahman, M. S., Aktar, U., Jani, M. R., & Shatabda, S. (2019). iPro70-FMWin: identifying Sigma70 promoters using multiple windowing and minimal features. Molecular Genetics and Genomics, 294(1), 69-84. 13. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780. 14. Kari, H., Bandi, S. M. S., Kumar, A., & Yella, V. R. (2022). Deepromclass: Delineator for eukaryotic core promoters employing deep neural networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 20(1), 802-807. 15. Martinez, G. S., Perez-Rueda, E., Kumar, A., Dutt, M., Maya, C. R., Ledesma-Dominguez, L., ... & Kelvin, D. J. (2024). CDBProm: the Comprehensive Directory of Bacterial Promoters. NAR Genomics and Bioinformatics, 6(1), lqae018. 16. Kuo, Syue-Ting (2023) High-Throughput Approaches Quantitatively Elucidate the Design Principles of Bacterial Regulatory Elements, National Taiwan University, Department of Life Science, Doctoral Dissertation 17. scanning model, May 2024, https://github.com/vickykao17/GBactPro/tree/main/scanning_model 18. Quinlan, A. R., & Hall, I. M. (2010). BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 26(6), 841-842. 19. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, 2825-2830. 20. Coleman, G. A., Davín, A. A., Mahendrarajah, T. A., Szánthó, L. L., Spang, A., Hugenholtz, P., ... & Williams, T. A. (2021). A rooted phylogeny resolves early bacterial evolution. Science, 372(6542), eabe0511. 描述 碩士
國立政治大學
資訊科學系
111753130資料來源 http://thesis.lib.nccu.edu.tw/record/#G0111753130 資料類型 thesis dc.contributor.advisor 張家銘 zh_TW dc.contributor.advisor Chang, Jia-Ming en_US dc.contributor.author (Authors) 高語謙 zh_TW dc.contributor.author (Authors) Kao, Yu-Chien en_US dc.creator (作者) 高語謙 zh_TW dc.creator (作者) Kao, Yu-Chien en_US dc.date (日期) 2024 en_US dc.date.accessioned 4-Sep-2024 14:59:32 (UTC+8) - dc.date.available 4-Sep-2024 14:59:32 (UTC+8) - dc.date.issued (上傳時間) 4-Sep-2024 14:59:32 (UTC+8) - dc.identifier (Other Identifiers) G0111753130 en_US dc.identifier.uri (URI) https://nccur.lib.nccu.edu.tw/handle/140.119/153377 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 資訊科學系 zh_TW dc.description (描述) 111753130 zh_TW dc.description.abstract (摘要) 啟動子為DNA上轉錄起始點上游特定基因片段,是調控DNA轉錄的重要位置,雖然已有許多啟動子預測工具,但大多只專注在少數物種,我們結合Promotech的跨物種預測,與台大周信宏教授開發的啟動子 scanning model,建立GBactPro,使用scanning model生成啟動子資料,訓練隨機森林模型以及深度學習模型找出每個區域的序列特徵,其中隨機森林模型可以透過相鄰區域的資訊學到更多的序列特徵,比單純計算序列結合能量的scanning model準確;深度學習模型使用1D-CNN及LSTM,利用LSTM可以學習長距離特徵的特性,不需透過scanning model 事先處理預測資料,也可正確地預測長序列中是否包含啟動子,我們的模型可以達到比Promotech更好的跨物種預測結果;此外使用GBactPro進行分區跨物種預測,在Minus10及Minus35區域的結果符合生物學上序列高度保留的特徵。最後在一些特定物種的特殊序列特徵,例如Alphaproteobacteria在 -7位置T出現的頻率較低,我們發現模型的預測結果在這些物種上會來得較差,符合生物學上的序列特徵。 zh_TW dc.description.abstract (摘要) Promoters are specific gene segments upstream of the transcription start site (TSS) on DNA and play an essential role in regulating DNA transcription. Although many promoter prediction tools exist, most focus on a limited number of species, especially E. coli. We have developed GBactPro by combining Promotech's cross-species prediction concept with the promoter scanning model developed by Professor Hsin-Hung David Chou from National Taiwan University. GBactPro uses the scanning model to generate data and identify sequence features in each region. The random forest model can learn more sequence features and is more accurate than the scanning model, which only calculates the sequence binding energy. We also trained deep learning models using 1D-CNN and LSTM. LSTM‘s ability to learn long-distance features predicts the presence of promoters in long sequences without the need for preprocessing via the scanning model. Our model achieves better cross-species prediction results than Promotech. Additionally, GBactPro performs region-specific cross-species predictions, with results in the -10 and -35 areas aligning with the biologically conserved sequence features. Finally, we observed that the model's performance is less effective for specific species with unique sequence characteristics, such as Alphaproteobacteria lacking T at position -7, which meets with the biological sequence features. en_US dc.description.tableofcontents 第一章 緒論 1 1.1. 細菌啟動子(promoter) 1 1.2. 啟動子預測工具 2 1.3. 啟動子預測問題 3 第二章 方法 5 2.1. 概覽 5 2.2. 啟動子 scanning model 6 2.3. 資料集 9 2.3.1. 正樣本 9 2.3.2. 負樣本 10 2.4. one-hot encoding 12 2.5. 隨機森林模型 12 2.6. 深度學習模型 13 2.7. 效能評估 14 2.8. 實驗環境 15 第三章 結果 16 3.1. 與Promotech結果比較 16 3.2. 預測較長序列是否包含啟動子 16 3.3. 隨機森林模型超參數選擇 17 3.4. 分區跨物種預測結果 18 3.5. 比較隨機森林模型與scanning model 24 3.6. 比較隨機森林模型與深度學習模型 25 3.7. Dis區域預測結果 25 第四章 討論 27 第五章 結論 29 參考文獻 30 zh_TW dc.format.extent 2504637 bytes - dc.format.mimetype application/pdf - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0111753130 en_US dc.subject (關鍵詞) 細菌啟動子 zh_TW dc.subject (關鍵詞) 機器學習 zh_TW dc.subject (關鍵詞) 隨機森林模型 zh_TW dc.subject (關鍵詞) 長短期記憶模型(LSTM) zh_TW dc.subject (關鍵詞) Bacterial promoters en_US dc.subject (關鍵詞) Machine learning en_US dc.subject (關鍵詞) Random forest en_US dc.subject (關鍵詞) Long short-term memory (LSTM) en_US dc.title (題名) GBactPro:基於機器學習方法對細菌啟動子進行跨物種預測 zh_TW dc.title (題名) GBactPro: General Bacterial Promoter prediction across species using machine learning en_US dc.type (資料類型) thesis en_US dc.relation.reference (參考文獻) 1. Crick, F. H. (1958, January). On protein synthesis. In Symp Soc Exp Biol (Vol. 12, No. 138-63, p. 8). 2. Pribnow, D. (1975). Nucleotide sequence of an RNA polymerase binding site at an early T7 promoter. Proceedings of the National Academy of Sciences, 72(3), 784-788. 3. Myers, K. S., Noguera, D. R., & Donohue, T. J. (2021). Promoter architecture differences among alphaproteobacteria and other bacterial taxa. MSystems, 6(4), 10-1128. 4. Bhandari, N., Khare, S., Walambe, R., & Kotecha, K. (2021). Comparison of machine learning and deep learning techniques in promoter prediction across diverse species. PeerJ Computer Science, 7, e365. 5. Oubounyt, M., Louadi, Z., Tayara, H., & Chong, K. T. (2019). DeePromoter: robust promoter predictor using deep learning. Frontiers in genetics, 10, 286. 6. Zhang, M., Jia, C., Li, F., Li, C., Zhu, Y., Akutsu, T., ... & Song, J. (2022). Critical assessment of computational tools for prokaryotic and eukaryotic promoter prediction. Briefings in Bioinformatics, 23(2), bbab551. 7. Chevez-Guardado, R., & Peña-Castillo, L. (2021). Promotech: a general tool for bacterial promoter recognition. Genome Biology, 22, 1-16. 8. Ho, T. K. (1995, August). Random decision forests. In Proceedings of 3rd international conference on document analysis and recognition (Vol. 1, pp. 278-282). IEEE. 9. Dey, R., & Salem, F. M. (2017, August). Gate-variants of gated recurrent unit (GRU) neural networks. In 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS) (pp. 1597-1600). IEEE. 10. Medsker, L. R., & Jain, L. (2001). Recurrent neural networks. Design and Applications, 5(64-67), 2. 11. Zhang, M., Li, F., Marquez-Lago, T. T., Leier, A., Fan, C., Kwoh, C. K., ... & Jia, C. (2019). MULTiPly: a novel multi-layer predictor for discovering general and specific types of promoters. Bioinformatics, 35(17), 2957-2965. 12. Rahman, M. S., Aktar, U., Jani, M. R., & Shatabda, S. (2019). iPro70-FMWin: identifying Sigma70 promoters using multiple windowing and minimal features. Molecular Genetics and Genomics, 294(1), 69-84. 13. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780. 14. Kari, H., Bandi, S. M. S., Kumar, A., & Yella, V. R. (2022). Deepromclass: Delineator for eukaryotic core promoters employing deep neural networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 20(1), 802-807. 15. Martinez, G. S., Perez-Rueda, E., Kumar, A., Dutt, M., Maya, C. R., Ledesma-Dominguez, L., ... & Kelvin, D. J. (2024). CDBProm: the Comprehensive Directory of Bacterial Promoters. NAR Genomics and Bioinformatics, 6(1), lqae018. 16. Kuo, Syue-Ting (2023) High-Throughput Approaches Quantitatively Elucidate the Design Principles of Bacterial Regulatory Elements, National Taiwan University, Department of Life Science, Doctoral Dissertation 17. scanning model, May 2024, https://github.com/vickykao17/GBactPro/tree/main/scanning_model 18. Quinlan, A. R., & Hall, I. M. (2010). BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 26(6), 841-842. 19. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, 2825-2830. 20. Coleman, G. A., Davín, A. A., Mahendrarajah, T. A., Szánthó, L. L., Spang, A., Hugenholtz, P., ... & Williams, T. A. (2021). A rooted phylogeny resolves early bacterial evolution. Science, 372(6542), eabe0511. zh_TW