Publications-Theses
Article View/Open
Publication Export
-
Google ScholarTM
NCCU Library
Citation Infomation
Related Publications in TAIR
題名 基於 Plaid 演算法的雙向分群缺失值插補方法
A Biclustering Approach to Missing-Value Imputation Based on the PLAID Algorithm作者 林詠盛
Lin, Yung-Sheng貢獻者 吳漢銘
Wu, Han-Ming
林詠盛
Lin, Yung-Sheng關鍵詞 缺失值補值
雙向分群
PLAID 演算法
Missing data imputation
Biclustering
PLAID algorithm日期 2025 上傳時間 4-Aug-2025 15:12:22 (UTC+8) 摘要 在資料分析過程中,缺失值的處理是極為關鍵的一步,尤其是在生物資訊領域中,資料集常常包含缺漏的數值,這可能會削弱研究結果的有效性。目前常用的補值方法如多重插補(Multiple Imputation)與最近鄰插補法(K-Nearest Neighbors, KNN),皆存在明顯的限制。多重插補仰賴強烈且往往難以驗證的隨機假設,而 KNN 在高維資料中則表現不佳。為了解決這些問題,我們提出一種基於 PLAID 雙向分群(biclustering)演算法的新型補值框架。PLAID 能夠偵測資料中的重疊模式與區塊結構,有效捕捉在基因表現與臨床資料中常見的局部共變異與功能模組。透過這些結構導引補值,我們的方法能實現具有生物學意義且具情境關聯性的缺值處理。我們進行模擬實驗與實際資料分析,並與現有方法進行比較,結果顯示,相較於傳統方法,善用雙向叢集結構能帶來更準確且更具生物學意涵的補值結果。
Missing value imputation is a critical step in data analysis, especially in bioinformatics, where datasets frequently contain missing entries that can undermine the validity of results. Current imputation methods, such as multiple imputation and k-nearest neighbors (KNN), have notable limitations. Multiple imputation depends on strong, and often untestable, stochastic assumptions, while KNN suffers from poor performance in high-dimensional data. To address these challenges, we propose a new imputation framework based on the PLAID biclustering algorithm. PLAID detects overlapping patterns and block structures in the data, capturing localized co-variation and functional modules commonly found in gene expression and clinical datasets. By using these structures to guide imputation, our method ensures biologically coherent and context-aware missing data handling. Through simulation studies and real-world data analyses, we compare our approach with existing methods. The results demonstrate that leveraging biclustering structures leads to more accurate and biologically meaningful imputation compared to conventional techniques.參考文獻 Aittokallio, T. (2010). Dealing with missing values in large-scale studies: Microarray data imputation and beyond. Briefings in Bioinformatics, 11(2), 253–264. Azur, M. J., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained equations: what is it and how does it work?. International Journal of Methods in Psychiatric Research, 20(1), 40–49. https://doi.org/10.1002/mpr.329 Andrews, T. S., & Hemberg, M. (2019). False signals induced by single-cell imputation. F1000Research, 7, 1740. https://doi.org/10.12688/f1000research.16613.2 Bishop, C. M. (1999). Variational principal components. In 1999 Ninth International Conference on Artificial Neural Networks ICANN 99 (Conf. Publ. No. 470) (Vol. 1, pp. 509–514). IET. Jadhav, A., Pramod, D., and Ramanathan, K. (2019). Comparison of performance of data imputation methods for numeric dataset. Applied Artificial Intelligence, 33(10), 913–933. Jin, L., Bi, Y., Hu, C., Qu, J., Shen, S., Wang, X., and Tian, Y. (2021). A comparative study of evaluating missing value imputation methods in label-free proteomics. Scientific Reports, 11(1), 1760. Lazzeroni, L. and Owen, A. (2002). Plaid models for gene expression data. Statistica Sinica, 12, 61–86. Liew, A.W.-C., Law, N.-F., and Yan, H. (2011). Missing value imputation for gene expression data: Computational techniques to recover missing data from available information. Briefings in Bioinformatics, 12(5), 498–513. Liao, S.G., Lin, Y., Kang, D.D., Chandra, D., Bon, J., Kaminski, N., and Tseng, G.C. (2014). Missing value imputation in high-dimensional phenomic data: imputable or not, and how? BMC bioinformatics, 15(1), 1–12. Oba, S., Sato, M.A., Takemasa, I., Monden, M., Matsubara, K.I., and Ishii, S. (2003). A Bayesian missing value estimation method for gene expression profile data. Bioinformatics, 19(16), 2088–2096. Stacklies, W., Redestig, H., Scholz, M., Walther, D., and Selbig, J. (2007). pcaMethods—A Bioconductor package providing PCA methods for incomplete data. Bioinformatics, 23(9), 1164–1167. https://doi.org/10.1093/bioinformatics/btm069 Schmitt, P., Mandel, J., and Guedj, M. (2015). A comparison of six methods for missing data imputation. Journal of Biometrics & Biostatistics, 6(1), 1. Samad, M., Kowsar, I., Rabbani, S., and Hou, Y. (2024). Deepifsac: Deep imputation of missing values using feature and sample attention within contrastive framework. Available at SSRN 5137008. Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., & Altman, R. B. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics, 17(6), 520–525. https://doi.org/10.1093/bioinformatics/17.6.520 Turner, H., Bailey, T., and Krzanowski, W. (2005). Improved biclustering of microarray data demonstrated through systematic performance tests. Computational Statistics& Data Analysis, 48(2), 235–254 Van Buuren, S., & Oudshoorn, K. (1999). Flexible multivariate imputation by MICE (Tech. Rep.). TNO Report, TNO. Van Buuren, S., & Groothuis-Oudshoorn, K. (2011). Mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 1–67. https://doi.org/10.18637/jss.v045.i03 Yang, Y., Xu, Z., & Song, D. (2016). Missing value imputation for microRNA expression data by using a GO-based similarity measure. BMC Bioinformatics, 17(Suppl 17), 109–116. https://doi.org/10.1186/s12859-016-1275-2 Zappia, L., Phipson, B., and Oshlack, A. (2017). Splatter: simulation of single-cell RNA sequencing data. Genome Biology, 18(1), 174. 描述 碩士
國立政治大學
統計學系
112354029資料來源 http://thesis.lib.nccu.edu.tw/record/#G0112354029 資料類型 thesis dc.contributor.advisor 吳漢銘 zh_TW dc.contributor.advisor Wu, Han-Ming en_US dc.contributor.author (Authors) 林詠盛 zh_TW dc.contributor.author (Authors) Lin, Yung-Sheng en_US dc.creator (作者) 林詠盛 zh_TW dc.creator (作者) Lin, Yung-Sheng en_US dc.date (日期) 2025 en_US dc.date.accessioned 4-Aug-2025 15:12:22 (UTC+8) - dc.date.available 4-Aug-2025 15:12:22 (UTC+8) - dc.date.issued (上傳時間) 4-Aug-2025 15:12:22 (UTC+8) - dc.identifier (Other Identifiers) G0112354029 en_US dc.identifier.uri (URI) https://nccur.lib.nccu.edu.tw/handle/140.119/158718 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 統計學系 zh_TW dc.description (描述) 112354029 zh_TW dc.description.abstract (摘要) 在資料分析過程中,缺失值的處理是極為關鍵的一步,尤其是在生物資訊領域中,資料集常常包含缺漏的數值,這可能會削弱研究結果的有效性。目前常用的補值方法如多重插補(Multiple Imputation)與最近鄰插補法(K-Nearest Neighbors, KNN),皆存在明顯的限制。多重插補仰賴強烈且往往難以驗證的隨機假設,而 KNN 在高維資料中則表現不佳。為了解決這些問題,我們提出一種基於 PLAID 雙向分群(biclustering)演算法的新型補值框架。PLAID 能夠偵測資料中的重疊模式與區塊結構,有效捕捉在基因表現與臨床資料中常見的局部共變異與功能模組。透過這些結構導引補值,我們的方法能實現具有生物學意義且具情境關聯性的缺值處理。我們進行模擬實驗與實際資料分析,並與現有方法進行比較,結果顯示,相較於傳統方法,善用雙向叢集結構能帶來更準確且更具生物學意涵的補值結果。 zh_TW dc.description.abstract (摘要) Missing value imputation is a critical step in data analysis, especially in bioinformatics, where datasets frequently contain missing entries that can undermine the validity of results. Current imputation methods, such as multiple imputation and k-nearest neighbors (KNN), have notable limitations. Multiple imputation depends on strong, and often untestable, stochastic assumptions, while KNN suffers from poor performance in high-dimensional data. To address these challenges, we propose a new imputation framework based on the PLAID biclustering algorithm. PLAID detects overlapping patterns and block structures in the data, capturing localized co-variation and functional modules commonly found in gene expression and clinical datasets. By using these structures to guide imputation, our method ensures biologically coherent and context-aware missing data handling. Through simulation studies and real-world data analyses, we compare our approach with existing methods. The results demonstrate that leveraging biclustering structures leads to more accurate and biologically meaningful imputation compared to conventional techniques. en_US dc.description.tableofcontents 誌謝 i Acknowledgements ii 摘要 iv Abstract v Contents vi List of Figures viii ListofTables ix 1 Introduction 1 1.1 Background 1 1.2 Motivation 2 1.3 Aims and Objectives 3 1.4 Significance of the Study 3 1.5 Organization of the Thesis 4 2 Some Existing Missing Values Imputation Methods 6 2.1 Mean Imputation 7 2.2 Median Imputation 7 2.3 K-Nearest Neighbors (KNN) 8 2.4 Singular Value Decomposition (SVD) 9 2.5 Bayesian Principal Component Analysis (BPCA) 9 2.6 Multiple Imputation by Chained Equations (MICE) 10 2.7 Evaluation Metrics 11 3 The PLAID Model Biclustering Method 13 3.1 Biclustering Overview 13 3.2 Mathematical Representation and Objective Function 14 3.3 Parameter Estimation 15 3.4 Sequential Layer Clustering and Stopping Criteria 17 4 Missing Values Imputation based on PLAID Algorithm 18 4.1 Input and Output 19 4.2 Initial Imputation 19 4.3 Bicluster Extraction with PLAID 20 4.4 Block-wise Imputation 20 4.5 Iterative Refinement 22 5 Simulation Studies 23 5.1 Simulated Dataset Configurations 24 5.2 Performance Comparisons 26 6 Real Data Examples 28 6.1 GSE Datasets 28 6.2 UCI Datasets 31 6.3 Datasets from R packages 33 7 Conclusion and Discussion 35 8 Table 39 9 Figures 46 Reference 62 zh_TW dc.format.extent 5405358 bytes - dc.format.mimetype application/pdf - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0112354029 en_US dc.subject (關鍵詞) 缺失值補值 zh_TW dc.subject (關鍵詞) 雙向分群 zh_TW dc.subject (關鍵詞) PLAID 演算法 zh_TW dc.subject (關鍵詞) Missing data imputation en_US dc.subject (關鍵詞) Biclustering en_US dc.subject (關鍵詞) PLAID algorithm en_US dc.title (題名) 基於 Plaid 演算法的雙向分群缺失值插補方法 zh_TW dc.title (題名) A Biclustering Approach to Missing-Value Imputation Based on the PLAID Algorithm en_US dc.type (資料類型) thesis en_US dc.relation.reference (參考文獻) Aittokallio, T. (2010). Dealing with missing values in large-scale studies: Microarray data imputation and beyond. Briefings in Bioinformatics, 11(2), 253–264. Azur, M. J., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained equations: what is it and how does it work?. International Journal of Methods in Psychiatric Research, 20(1), 40–49. https://doi.org/10.1002/mpr.329 Andrews, T. S., & Hemberg, M. (2019). False signals induced by single-cell imputation. F1000Research, 7, 1740. https://doi.org/10.12688/f1000research.16613.2 Bishop, C. M. (1999). Variational principal components. In 1999 Ninth International Conference on Artificial Neural Networks ICANN 99 (Conf. Publ. No. 470) (Vol. 1, pp. 509–514). IET. Jadhav, A., Pramod, D., and Ramanathan, K. (2019). Comparison of performance of data imputation methods for numeric dataset. Applied Artificial Intelligence, 33(10), 913–933. Jin, L., Bi, Y., Hu, C., Qu, J., Shen, S., Wang, X., and Tian, Y. (2021). A comparative study of evaluating missing value imputation methods in label-free proteomics. Scientific Reports, 11(1), 1760. Lazzeroni, L. and Owen, A. (2002). Plaid models for gene expression data. Statistica Sinica, 12, 61–86. Liew, A.W.-C., Law, N.-F., and Yan, H. (2011). Missing value imputation for gene expression data: Computational techniques to recover missing data from available information. Briefings in Bioinformatics, 12(5), 498–513. Liao, S.G., Lin, Y., Kang, D.D., Chandra, D., Bon, J., Kaminski, N., and Tseng, G.C. (2014). Missing value imputation in high-dimensional phenomic data: imputable or not, and how? BMC bioinformatics, 15(1), 1–12. Oba, S., Sato, M.A., Takemasa, I., Monden, M., Matsubara, K.I., and Ishii, S. (2003). A Bayesian missing value estimation method for gene expression profile data. Bioinformatics, 19(16), 2088–2096. Stacklies, W., Redestig, H., Scholz, M., Walther, D., and Selbig, J. (2007). pcaMethods—A Bioconductor package providing PCA methods for incomplete data. Bioinformatics, 23(9), 1164–1167. https://doi.org/10.1093/bioinformatics/btm069 Schmitt, P., Mandel, J., and Guedj, M. (2015). A comparison of six methods for missing data imputation. Journal of Biometrics & Biostatistics, 6(1), 1. Samad, M., Kowsar, I., Rabbani, S., and Hou, Y. (2024). Deepifsac: Deep imputation of missing values using feature and sample attention within contrastive framework. Available at SSRN 5137008. Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., & Altman, R. B. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics, 17(6), 520–525. https://doi.org/10.1093/bioinformatics/17.6.520 Turner, H., Bailey, T., and Krzanowski, W. (2005). Improved biclustering of microarray data demonstrated through systematic performance tests. Computational Statistics& Data Analysis, 48(2), 235–254 Van Buuren, S., & Oudshoorn, K. (1999). Flexible multivariate imputation by MICE (Tech. Rep.). TNO Report, TNO. Van Buuren, S., & Groothuis-Oudshoorn, K. (2011). Mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 1–67. https://doi.org/10.18637/jss.v045.i03 Yang, Y., Xu, Z., & Song, D. (2016). Missing value imputation for microRNA expression data by using a GO-based similarity measure. BMC Bioinformatics, 17(Suppl 17), 109–116. https://doi.org/10.1186/s12859-016-1275-2 Zappia, L., Phipson, B., and Oshlack, A. (2017). Splatter: simulation of single-cell RNA sequencing data. Genome Biology, 18(1), 174. zh_TW
