Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

題名 基於 Plaid 演算法的雙向分群缺失值插補方法
A Biclustering Approach to Missing-Value Imputation Based on the PLAID Algorithm
作者 林詠盛
Lin, Yung-Sheng
貢獻者 吳漢銘
Wu, Han-Ming
林詠盛
Lin, Yung-Sheng
關鍵詞 缺失值補值
雙向分群
PLAID 演算法
Missing data imputation
Biclustering
PLAID algorithm
日期 2025
上傳時間 4-Aug-2025 15:12:22 (UTC+8)
摘要 在資料分析過程中,缺失值的處理是極為關鍵的一步,尤其是在生物資訊領域中,資料集常常包含缺漏的數值,這可能會削弱研究結果的有效性。目前常用的補值方法如多重插補(Multiple Imputation)與最近鄰插補法(K-Nearest Neighbors, KNN),皆存在明顯的限制。多重插補仰賴強烈且往往難以驗證的隨機假設,而 KNN 在高維資料中則表現不佳。為了解決這些問題,我們提出一種基於 PLAID 雙向分群(biclustering)演算法的新型補值框架。PLAID 能夠偵測資料中的重疊模式與區塊結構,有效捕捉在基因表現與臨床資料中常見的局部共變異與功能模組。透過這些結構導引補值,我們的方法能實現具有生物學意義且具情境關聯性的缺值處理。我們進行模擬實驗與實際資料分析,並與現有方法進行比較,結果顯示,相較於傳統方法,善用雙向叢集結構能帶來更準確且更具生物學意涵的補值結果。
Missing value imputation is a critical step in data analysis, especially in bioinformatics, where datasets frequently contain missing entries that can undermine the validity of results. Current imputation methods, such as multiple imputation and k-nearest neighbors (KNN), have notable limitations. Multiple imputation depends on strong, and often untestable, stochastic assumptions, while KNN suffers from poor performance in high-dimensional data. To address these challenges, we propose a new imputation framework based on the PLAID biclustering algorithm. PLAID detects overlapping patterns and block structures in the data, capturing localized co-variation and functional modules commonly found in gene expression and clinical datasets. By using these structures to guide imputation, our method ensures biologically coherent and context-aware missing data handling. Through simulation studies and real-world data analyses, we compare our approach with existing methods. The results demonstrate that leveraging biclustering structures leads to more accurate and biologically meaningful imputation compared to conventional techniques.
參考文獻 Aittokallio, T. (2010). Dealing with missing values in large-scale studies: Microarray data imputation and beyond. Briefings in Bioinformatics, 11(2), 253–264. Azur, M. J., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained equations: what is it and how does it work?. International Journal of Methods in Psychiatric Research, 20(1), 40–49. https://doi.org/10.1002/mpr.329 Andrews, T. S., & Hemberg, M. (2019). False signals induced by single-cell imputation. F1000Research, 7, 1740. https://doi.org/10.12688/f1000research.16613.2 Bishop, C. M. (1999). Variational principal components. In 1999 Ninth International Conference on Artificial Neural Networks ICANN 99 (Conf. Publ. No. 470) (Vol. 1, pp. 509–514). IET. Jadhav, A., Pramod, D., and Ramanathan, K. (2019). Comparison of performance of data imputation methods for numeric dataset. Applied Artificial Intelligence, 33(10), 913–933. Jin, L., Bi, Y., Hu, C., Qu, J., Shen, S., Wang, X., and Tian, Y. (2021). A comparative study of evaluating missing value imputation methods in label-free proteomics. Scientific Reports, 11(1), 1760. Lazzeroni, L. and Owen, A. (2002). Plaid models for gene expression data. Statistica Sinica, 12, 61–86. Liew, A.W.-C., Law, N.-F., and Yan, H. (2011). Missing value imputation for gene expression data: Computational techniques to recover missing data from available information. Briefings in Bioinformatics, 12(5), 498–513. Liao, S.G., Lin, Y., Kang, D.D., Chandra, D., Bon, J., Kaminski, N., and Tseng, G.C. (2014). Missing value imputation in high-dimensional phenomic data: imputable or not, and how? BMC bioinformatics, 15(1), 1–12. Oba, S., Sato, M.A., Takemasa, I., Monden, M., Matsubara, K.I., and Ishii, S. (2003). A Bayesian missing value estimation method for gene expression profile data. Bioinformatics, 19(16), 2088–2096. Stacklies, W., Redestig, H., Scholz, M., Walther, D., and Selbig, J. (2007). pcaMethods—A Bioconductor package providing PCA methods for incomplete data. Bioinformatics, 23(9), 1164–1167. https://doi.org/10.1093/bioinformatics/btm069 Schmitt, P., Mandel, J., and Guedj, M. (2015). A comparison of six methods for missing data imputation. Journal of Biometrics & Biostatistics, 6(1), 1. Samad, M., Kowsar, I., Rabbani, S., and Hou, Y. (2024). Deepifsac: Deep imputation of missing values using feature and sample attention within contrastive framework. Available at SSRN 5137008. Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., & Altman, R. B. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics, 17(6), 520–525. https://doi.org/10.1093/bioinformatics/17.6.520 Turner, H., Bailey, T., and Krzanowski, W. (2005). Improved biclustering of microarray data demonstrated through systematic performance tests. Computational Statistics& Data Analysis, 48(2), 235–254 Van Buuren, S., & Oudshoorn, K. (1999). Flexible multivariate imputation by MICE (Tech. Rep.). TNO Report, TNO. Van Buuren, S., & Groothuis-Oudshoorn, K. (2011). Mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 1–67. https://doi.org/10.18637/jss.v045.i03 Yang, Y., Xu, Z., & Song, D. (2016). Missing value imputation for microRNA expression data by using a GO-based similarity measure. BMC Bioinformatics, 17(Suppl 17), 109–116. https://doi.org/10.1186/s12859-016-1275-2 Zappia, L., Phipson, B., and Oshlack, A. (2017). Splatter: simulation of single-cell RNA sequencing data. Genome Biology, 18(1), 174.
描述 碩士
國立政治大學
統計學系
112354029
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0112354029
資料類型 thesis
dc.contributor.advisor 吳漢銘zh_TW
dc.contributor.advisor Wu, Han-Mingen_US
dc.contributor.author (Authors) 林詠盛zh_TW
dc.contributor.author (Authors) Lin, Yung-Shengen_US
dc.creator (作者) 林詠盛zh_TW
dc.creator (作者) Lin, Yung-Shengen_US
dc.date (日期) 2025en_US
dc.date.accessioned 4-Aug-2025 15:12:22 (UTC+8)-
dc.date.available 4-Aug-2025 15:12:22 (UTC+8)-
dc.date.issued (上傳時間) 4-Aug-2025 15:12:22 (UTC+8)-
dc.identifier (Other Identifiers) G0112354029en_US
dc.identifier.uri (URI) https://nccur.lib.nccu.edu.tw/handle/140.119/158718-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 統計學系zh_TW
dc.description (描述) 112354029zh_TW
dc.description.abstract (摘要) 在資料分析過程中,缺失值的處理是極為關鍵的一步,尤其是在生物資訊領域中,資料集常常包含缺漏的數值,這可能會削弱研究結果的有效性。目前常用的補值方法如多重插補(Multiple Imputation)與最近鄰插補法(K-Nearest Neighbors, KNN),皆存在明顯的限制。多重插補仰賴強烈且往往難以驗證的隨機假設,而 KNN 在高維資料中則表現不佳。為了解決這些問題,我們提出一種基於 PLAID 雙向分群(biclustering)演算法的新型補值框架。PLAID 能夠偵測資料中的重疊模式與區塊結構,有效捕捉在基因表現與臨床資料中常見的局部共變異與功能模組。透過這些結構導引補值,我們的方法能實現具有生物學意義且具情境關聯性的缺值處理。我們進行模擬實驗與實際資料分析,並與現有方法進行比較,結果顯示,相較於傳統方法,善用雙向叢集結構能帶來更準確且更具生物學意涵的補值結果。zh_TW
dc.description.abstract (摘要) Missing value imputation is a critical step in data analysis, especially in bioinformatics, where datasets frequently contain missing entries that can undermine the validity of results. Current imputation methods, such as multiple imputation and k-nearest neighbors (KNN), have notable limitations. Multiple imputation depends on strong, and often untestable, stochastic assumptions, while KNN suffers from poor performance in high-dimensional data. To address these challenges, we propose a new imputation framework based on the PLAID biclustering algorithm. PLAID detects overlapping patterns and block structures in the data, capturing localized co-variation and functional modules commonly found in gene expression and clinical datasets. By using these structures to guide imputation, our method ensures biologically coherent and context-aware missing data handling. Through simulation studies and real-world data analyses, we compare our approach with existing methods. The results demonstrate that leveraging biclustering structures leads to more accurate and biologically meaningful imputation compared to conventional techniques.en_US
dc.description.tableofcontents 誌謝 i Acknowledgements ii 摘要 iv Abstract v Contents vi List of Figures viii ListofTables ix 1 Introduction 1 1.1 Background 1 1.2 Motivation 2 1.3 Aims and Objectives 3 1.4 Significance of the Study 3 1.5 Organization of the Thesis 4 2 Some Existing Missing Values Imputation Methods 6 2.1 Mean Imputation 7 2.2 Median Imputation 7 2.3 K-Nearest Neighbors (KNN) 8 2.4 Singular Value Decomposition (SVD) 9 2.5 Bayesian Principal Component Analysis (BPCA) 9 2.6 Multiple Imputation by Chained Equations (MICE) 10 2.7 Evaluation Metrics 11 3 The PLAID Model Biclustering Method 13 3.1 Biclustering Overview 13 3.2 Mathematical Representation and Objective Function 14 3.3 Parameter Estimation 15 3.4 Sequential Layer Clustering and Stopping Criteria 17 4 Missing Values Imputation based on PLAID Algorithm 18 4.1 Input and Output 19 4.2 Initial Imputation 19 4.3 Bicluster Extraction with PLAID 20 4.4 Block-wise Imputation 20 4.5 Iterative Refinement 22 5 Simulation Studies 23 5.1 Simulated Dataset Configurations 24 5.2 Performance Comparisons 26 6 Real Data Examples 28 6.1 GSE Datasets 28 6.2 UCI Datasets 31 6.3 Datasets from R packages 33 7 Conclusion and Discussion 35 8 Table 39 9 Figures 46 Reference 62zh_TW
dc.format.extent 5405358 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0112354029en_US
dc.subject (關鍵詞) 缺失值補值zh_TW
dc.subject (關鍵詞) 雙向分群zh_TW
dc.subject (關鍵詞) PLAID 演算法zh_TW
dc.subject (關鍵詞) Missing data imputationen_US
dc.subject (關鍵詞) Biclusteringen_US
dc.subject (關鍵詞) PLAID algorithmen_US
dc.title (題名) 基於 Plaid 演算法的雙向分群缺失值插補方法zh_TW
dc.title (題名) A Biclustering Approach to Missing-Value Imputation Based on the PLAID Algorithmen_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) Aittokallio, T. (2010). Dealing with missing values in large-scale studies: Microarray data imputation and beyond. Briefings in Bioinformatics, 11(2), 253–264. Azur, M. J., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained equations: what is it and how does it work?. International Journal of Methods in Psychiatric Research, 20(1), 40–49. https://doi.org/10.1002/mpr.329 Andrews, T. S., & Hemberg, M. (2019). False signals induced by single-cell imputation. F1000Research, 7, 1740. https://doi.org/10.12688/f1000research.16613.2 Bishop, C. M. (1999). Variational principal components. In 1999 Ninth International Conference on Artificial Neural Networks ICANN 99 (Conf. Publ. No. 470) (Vol. 1, pp. 509–514). IET. Jadhav, A., Pramod, D., and Ramanathan, K. (2019). Comparison of performance of data imputation methods for numeric dataset. Applied Artificial Intelligence, 33(10), 913–933. Jin, L., Bi, Y., Hu, C., Qu, J., Shen, S., Wang, X., and Tian, Y. (2021). A comparative study of evaluating missing value imputation methods in label-free proteomics. Scientific Reports, 11(1), 1760. Lazzeroni, L. and Owen, A. (2002). Plaid models for gene expression data. Statistica Sinica, 12, 61–86. Liew, A.W.-C., Law, N.-F., and Yan, H. (2011). Missing value imputation for gene expression data: Computational techniques to recover missing data from available information. Briefings in Bioinformatics, 12(5), 498–513. Liao, S.G., Lin, Y., Kang, D.D., Chandra, D., Bon, J., Kaminski, N., and Tseng, G.C. (2014). Missing value imputation in high-dimensional phenomic data: imputable or not, and how? BMC bioinformatics, 15(1), 1–12. Oba, S., Sato, M.A., Takemasa, I., Monden, M., Matsubara, K.I., and Ishii, S. (2003). A Bayesian missing value estimation method for gene expression profile data. Bioinformatics, 19(16), 2088–2096. Stacklies, W., Redestig, H., Scholz, M., Walther, D., and Selbig, J. (2007). pcaMethods—A Bioconductor package providing PCA methods for incomplete data. Bioinformatics, 23(9), 1164–1167. https://doi.org/10.1093/bioinformatics/btm069 Schmitt, P., Mandel, J., and Guedj, M. (2015). A comparison of six methods for missing data imputation. Journal of Biometrics & Biostatistics, 6(1), 1. Samad, M., Kowsar, I., Rabbani, S., and Hou, Y. (2024). Deepifsac: Deep imputation of missing values using feature and sample attention within contrastive framework. Available at SSRN 5137008. Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., & Altman, R. B. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics, 17(6), 520–525. https://doi.org/10.1093/bioinformatics/17.6.520 Turner, H., Bailey, T., and Krzanowski, W. (2005). Improved biclustering of microarray data demonstrated through systematic performance tests. Computational Statistics& Data Analysis, 48(2), 235–254 Van Buuren, S., & Oudshoorn, K. (1999). Flexible multivariate imputation by MICE (Tech. Rep.). TNO Report, TNO. Van Buuren, S., & Groothuis-Oudshoorn, K. (2011). Mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 1–67. https://doi.org/10.18637/jss.v045.i03 Yang, Y., Xu, Z., & Song, D. (2016). Missing value imputation for microRNA expression data by using a GO-based similarity measure. BMC Bioinformatics, 17(Suppl 17), 109–116. https://doi.org/10.1186/s12859-016-1275-2 Zappia, L., Phipson, B., and Oshlack, A. (2017). Splatter: simulation of single-cell RNA sequencing data. Genome Biology, 18(1), 174.zh_TW