學術產出-Theses

Article View/Open

Publication Export

Google ScholarTM

政大圖書館

Citation Infomation

  • No doi shows Citation Infomation
題名 一個將表格型資料轉換成影像的監督式方法用於以卷積神經網絡為基礎的深度學習預測
The supervised approach for converting tabular data into images for CNN-based deep learning prediction
作者 凃于珊
Tu, Yu-Shan
貢獻者 吳漢銘
凃于珊
Tu, Yu-Shan
關鍵詞 組內和組間分析法
監督式距離
卷積神經網絡
Within and between analysis
Supervised distance matrix
Convolutional neural network
日期 2023
上傳時間 2-Aug-2023 13:03:47 (UTC+8)
摘要 在處理表格型資料的分類預測問題時,如果使用傳統的機器學習方法,如決策樹、隨機森林和支持向量機,我們通常需要進行資料特徵擷取和預處理。然而,近年來,有研究提出將表格型資料轉換成圖像,然後利用卷積神經網絡模型來訓練和預測轉換後的資料圖像。這種方法不僅能省去前述的預處理步驟,還能獲得更好的預測效果。在這些方法中,表格資料圖像生成器(Image Generator for Tabular Data,簡稱 IGTD)透過最小化特徵距離矩陣與目標圖像像素位置距離矩陣之間的差異,將表格型資料中的每個特徵(變數)分配到圖像中的唯一像素位置,從而生成每一個樣本的圖像。在這些圖像中,像素強度反映了樣本中相對應特徵(變數)的值。IGTD 方法不需要資料領域知識,且能提供更佳的特徵鄰域結構。本研究基於 IGTD 的特性,引入了監督式距離計算的概念,並在生成圖像的過程中加入資料類別資訊,以提高圖像分類預測的準確性。首先,我們根據資料類別資訊,採用組內和組間分析法(Within and Between Analysis,簡稱WABA),計算特徵間的不同相關係數及其對應的距離。然後,我們利用這些由不同相關係數生成的圖像進行資料擴充,以增加樣本數,解決資料樣本數遠小於特徵數的問題。此外,我們也考慮了不同相關係數轉換成距離的轉換公式,以了解其對資料生成圖像的影響,以及對卷積神經網絡模型結果的影響。我們將所提出的方法應用於多個實際的基因表達資料集,結果顯示,新方法優於 IGTD。除了能顯著提升卷積神經網絡模型的預測準確性外,同時也擴展了卷積神經網絡在表格型資料應用的範疇。
When dealing with classification prediction problems of tabular data, traditional machine learning methods such as decision trees, random forests, and support vector machines usually require data feature extraction and preprocessing. However, recent research has proposed converting tabular data into images, and then using convolutional neural network models to train and predict the converted data images. This method not only eliminates the aforementioned preprocessing steps but also achieves better prediction results. Among these methods, the Image Generator for Tabular Data (IGTD) minimizes the difference between the feature distance matrix and the target image pixel position distance matrix, assigning each feature (variable) in the tabular data to a unique pixel position in the image, thereby generating an image for each sample. In these images, the pixel intensity reflects the value of the corresponding feature (variable) in the sample. The IGTD method does not require domain knowledge of the data and can provide a better feature neighborhood structure. Based on the characteristics of IGTD, this study introduces the concept of supervised distance calculation and incorporates data category information during the image generation process to improve the accuracy of image classification prediction. First, we use the Within and Between Analysis (WABA) based on data category information to calculate different correlation coefficients and their corresponding distances between features. Then, we use the images generated by these different correlation coefficients for data augmentation to increase the number of samples and solve the problem of the number of data samples being far less than the number of features. In addition, we also consider different conversion formulas for converting correlation coefficients into distances to understand their impact on data image generation and the results of the convolutional neural network model. We applied the proposed method to multiple actual gene expression datasets. The results show that the new method is superior to IGTD. In addition to significantly improving the prediction accuracy of the convolutional neural network model, it also expands the application of convolutional neural networks to the tabular data.
參考文獻 Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S., Rosenwald, A., Boldrick, J. C., Sabet, H., Tran, T., Yu, X., et al. (2000). Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature, 403(6769), 503–511.

Armstrong, S. A., Staunton, J. E., Silverman, L. B., Pieters, R., den Boer, M. L., Minden, M. D., Sallan, S. E., Lander, E. S., Golub, T. R., & Korsmeyer, S. J.(2002). Mll translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nature genetics,0(1), 41–47.

Bazgir, O., Zhang, R., Dhruba, S. R., Rahman, R., Ghosh, S., & Pal, R. (2020). Representation of features as images with neighborhood dependencies for compatibility with convolutional neural networks. Nature communications, 11(1),
4391.

Bengio, Y. (2012). Practical recommendations for gradient-based training of deep architectures. Neural Networks: Tricks of the Trade: Second Edition, (pp. 437–478).

Bertucci, F., Salas, S., Eysteries, S., Nasser, V., Finetti, P., Ginestier, C., CharafeJauffret, E., Loriod, B., Bachelart, L., Montfort, J., et al. (2004). Gene expression profiling of colon cancer by dna microarrays and correlation with histoclinical parameters. Oncogene, 23(7), 1377–1391.

Chollet, F. (2021). Deep learning with Python. Simon and Schuster

Ciregan, D., Meier, U., & Schmidhuber, J. (2012). Multi-column deep neural networks for image classification. In 2012 IEEE conference on computer vision and pattern recognition (pp. 3642–3649).: IEEE.

Dansereau, F., Alutto, J. A., & Yammarino, F. J. (1984). Theory testing in organizational behavior: The varient approach. Prentice Hall.

Díaz-Uriarte, R. (2005). Supervised methods with genomic data: a review and cautionary view. Data analysis and visualization in genomics and proteomics,(pp. 193–214).

Gu, Q., Li, Z., & Han, J. (2012). Generalized fisher score for feature selection. arXiv preprint arXiv:1202.3725.

Gulshan, V., Peng, L., Coram, M., Stumpe, M. C., Wu, D., Narayanaswamy, A., Venugopalan, S., Widner, K., Madams, T., & Cuadros, J. (2016). Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. Jama, 316(22), 2402–2410.

Hart, P. E., Stork, D. G., & Duda, R. O. (2000). Pattern classification. Wiley Hoboken.

Hua, J., Xiong, Z., Lowey, J., Suh, E., & Dougherty, E. R. (2005). Optimal number of features as a function of sample size for various classification rules. Bioinformatics, 21(8), 1509–1515.

Jirapech-Umpai, T. & Aitken, S. (2005). Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes. BMC bioinformatics, 6(1), 1–11.

Kamnitsas, K., Ledig, C., Newcombe, V. F., Simpson, J. P., Kane, A. D., Menon, D. K., Rueckert, D., & Glocker, B. (2017). Efficient multi-scale 3d cnn with fully connected crf for accurate brain lesion segmentation. Medical image analysis, 36, 61–78.

Khan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C. R., Peterson, C., et al. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature medicine, 7(6), 673–679.

Kim, K., Zhang, S., Jiang, K., Cai, L., Lee, I.-B., Feldman, L. J., & Huang, H. (2007). Measuring similarities between gene expression profiles through new data transformations. BMC bioinformatics, 8, 1–14.

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90.

Lee, J. W., Lee, J. B., Park, M., & Song, S. H. (2005). An extensive comparison of recent classification tools applied to microarray data. Computational Statistics & Data Analysis, 48(4), 869–885.

Li, Y., Campbell, C., & Tipping, M. (2002). Bayesian automatic relevance determination algorithms for classifying gene expression data. Bioinformatics, 18(10), 1332–1339.

Ma, S. & Zhang, Z. (2018). Omicsmapnet: Transforming omics data to take advantage of deep convolutional neural network for discovery. arXiv preprint arXiv:1804.05283.

Odena, A., Olah, C., & Shlens, J. (2017). Conditional image synthesis with auxiliary classifier gans. In International conference on machine learning (pp. 2642–2651).: PMLR.

Pomeroy, S. L., Tamayo, P., Gaasenbeek, M., Sturla, L. M., Angelo, M., McLaughlin, M. E., Kim, J. Y., Goumnerova, L. C., Black, P. M., Lau, C., et al. (2002). Prediction of central nervous system embryonal tumour outcome
based on gene expression. Nature, 415(6870), 436–442.

Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2013). Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229.

Sharma, A., Vans, E., Shigemizu, D., Boroevich, K. A., & Tsunoda, T. (2019). Deepinsight: A methodology to transform a non-image data to an image for convolution neural network architecture. Scientific reports, 9(1), 11399.

Simonyan, K. & Zisserman, A. (2014a). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

Simonyan, K. & Zisserman, A. (2014b). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

Singh, D., Febbo, P. G., Ross, K., Jackson, D. G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A. A., D’Amico, A. V., Richie, J. P., et al. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer cell, 1(2), 203–209.

Wainberg, M., Merico, D., Delong, A., & Frey, B. J. (2018). Deep learning in biomedicine. Nature biotechnology, 36(9), 829–838.

Wu, H.-M., Tien, Y.-J., Ho, M.-R., Hwu, H.-G., Lin, W.-c., Tao, M.-H., & Chen, C.-h. (2018). Covariate-adjusted heatmaps for visualizing biological data via correlation decomposition. Bioinformatics, 34(20), 3529–3538.

Yeung, K. Y., Bumgarner, R. E., & Raftery, A. E. (2005). Bayesian model averaging: development of an improved multi-class, gene selection and classification tool for microarray data. Bioinformatics, 21(10), 2394–2402.

Zhu, Y., Brettin, T., Xia, F., Partin, A., Shukla, M., Yoo, H., Evrard, Y. A., Doroshow, J. H., & Stevens, R. L. (2021). Converting tabular data into images for deep learning with convolutional neural networks. Scientific eports, 11(1), 11325.
描述 碩士
國立政治大學
統計學系
110354011
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0110354011
資料類型 thesis
dc.contributor.advisor 吳漢銘zh_TW
dc.contributor.author (Authors) 凃于珊zh_TW
dc.contributor.author (Authors) Tu, Yu-Shanen_US
dc.creator (作者) 凃于珊zh_TW
dc.creator (作者) Tu, Yu-Shanen_US
dc.date (日期) 2023en_US
dc.date.accessioned 2-Aug-2023 13:03:47 (UTC+8)-
dc.date.available 2-Aug-2023 13:03:47 (UTC+8)-
dc.date.issued (上傳時間) 2-Aug-2023 13:03:47 (UTC+8)-
dc.identifier (Other Identifiers) G0110354011en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/146305-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 統計學系zh_TW
dc.description (描述) 110354011zh_TW
dc.description.abstract (摘要) 在處理表格型資料的分類預測問題時,如果使用傳統的機器學習方法,如決策樹、隨機森林和支持向量機,我們通常需要進行資料特徵擷取和預處理。然而,近年來,有研究提出將表格型資料轉換成圖像,然後利用卷積神經網絡模型來訓練和預測轉換後的資料圖像。這種方法不僅能省去前述的預處理步驟,還能獲得更好的預測效果。在這些方法中,表格資料圖像生成器(Image Generator for Tabular Data,簡稱 IGTD)透過最小化特徵距離矩陣與目標圖像像素位置距離矩陣之間的差異,將表格型資料中的每個特徵(變數)分配到圖像中的唯一像素位置,從而生成每一個樣本的圖像。在這些圖像中,像素強度反映了樣本中相對應特徵(變數)的值。IGTD 方法不需要資料領域知識,且能提供更佳的特徵鄰域結構。本研究基於 IGTD 的特性,引入了監督式距離計算的概念,並在生成圖像的過程中加入資料類別資訊,以提高圖像分類預測的準確性。首先,我們根據資料類別資訊,採用組內和組間分析法(Within and Between Analysis,簡稱WABA),計算特徵間的不同相關係數及其對應的距離。然後,我們利用這些由不同相關係數生成的圖像進行資料擴充,以增加樣本數,解決資料樣本數遠小於特徵數的問題。此外,我們也考慮了不同相關係數轉換成距離的轉換公式,以了解其對資料生成圖像的影響,以及對卷積神經網絡模型結果的影響。我們將所提出的方法應用於多個實際的基因表達資料集,結果顯示,新方法優於 IGTD。除了能顯著提升卷積神經網絡模型的預測準確性外,同時也擴展了卷積神經網絡在表格型資料應用的範疇。zh_TW
dc.description.abstract (摘要) When dealing with classification prediction problems of tabular data, traditional machine learning methods such as decision trees, random forests, and support vector machines usually require data feature extraction and preprocessing. However, recent research has proposed converting tabular data into images, and then using convolutional neural network models to train and predict the converted data images. This method not only eliminates the aforementioned preprocessing steps but also achieves better prediction results. Among these methods, the Image Generator for Tabular Data (IGTD) minimizes the difference between the feature distance matrix and the target image pixel position distance matrix, assigning each feature (variable) in the tabular data to a unique pixel position in the image, thereby generating an image for each sample. In these images, the pixel intensity reflects the value of the corresponding feature (variable) in the sample. The IGTD method does not require domain knowledge of the data and can provide a better feature neighborhood structure. Based on the characteristics of IGTD, this study introduces the concept of supervised distance calculation and incorporates data category information during the image generation process to improve the accuracy of image classification prediction. First, we use the Within and Between Analysis (WABA) based on data category information to calculate different correlation coefficients and their corresponding distances between features. Then, we use the images generated by these different correlation coefficients for data augmentation to increase the number of samples and solve the problem of the number of data samples being far less than the number of features. In addition, we also consider different conversion formulas for converting correlation coefficients into distances to understand their impact on data image generation and the results of the convolutional neural network model. We applied the proposed method to multiple actual gene expression datasets. The results show that the new method is superior to IGTD. In addition to significantly improving the prediction accuracy of the convolutional neural network model, it also expands the application of convolutional neural networks to the tabular data.en_US
dc.description.tableofcontents 1 緒論 1
2 高維度資料的變數篩選法及分類方法 4
2.1 費雪準則之變數篩選 4
2.2 分類方法 7
3 將表格數據轉換為圖像以進行深度學習的卷積神經網絡 8
3.1 表格資料圖像生成器 8
3.2 卷積神經網絡模型架構 12
4 引入類別變數資訊的監督式表格資料圖像生成器 12
4.1 相關係數量測指標 12
4.2 組內和組間分析法 13
5 實際資料分析 16
5.1 實驗設定 16
5.2 表格型資料轉圖像結果 17
5.3 各分類方法預測準確率比較 20
5.4 透過熱圖觀察資料的組間差異與組內一致性 22
5.5 探索顯著性基因的存在與在圖像上之排列位置是否影響 CNN 特徵擷取 27
6 結論與討論 31
參考文獻 33
A 附錄 38
zh_TW
dc.format.extent 14426107 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0110354011en_US
dc.subject (關鍵詞) 組內和組間分析法zh_TW
dc.subject (關鍵詞) 監督式距離zh_TW
dc.subject (關鍵詞) 卷積神經網絡zh_TW
dc.subject (關鍵詞) Within and between analysisen_US
dc.subject (關鍵詞) Supervised distance matrixen_US
dc.subject (關鍵詞) Convolutional neural networken_US
dc.title (題名) 一個將表格型資料轉換成影像的監督式方法用於以卷積神經網絡為基礎的深度學習預測zh_TW
dc.title (題名) The supervised approach for converting tabular data into images for CNN-based deep learning predictionen_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S., Rosenwald, A., Boldrick, J. C., Sabet, H., Tran, T., Yu, X., et al. (2000). Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature, 403(6769), 503–511.

Armstrong, S. A., Staunton, J. E., Silverman, L. B., Pieters, R., den Boer, M. L., Minden, M. D., Sallan, S. E., Lander, E. S., Golub, T. R., & Korsmeyer, S. J.(2002). Mll translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nature genetics,0(1), 41–47.

Bazgir, O., Zhang, R., Dhruba, S. R., Rahman, R., Ghosh, S., & Pal, R. (2020). Representation of features as images with neighborhood dependencies for compatibility with convolutional neural networks. Nature communications, 11(1),
4391.

Bengio, Y. (2012). Practical recommendations for gradient-based training of deep architectures. Neural Networks: Tricks of the Trade: Second Edition, (pp. 437–478).

Bertucci, F., Salas, S., Eysteries, S., Nasser, V., Finetti, P., Ginestier, C., CharafeJauffret, E., Loriod, B., Bachelart, L., Montfort, J., et al. (2004). Gene expression profiling of colon cancer by dna microarrays and correlation with histoclinical parameters. Oncogene, 23(7), 1377–1391.

Chollet, F. (2021). Deep learning with Python. Simon and Schuster

Ciregan, D., Meier, U., & Schmidhuber, J. (2012). Multi-column deep neural networks for image classification. In 2012 IEEE conference on computer vision and pattern recognition (pp. 3642–3649).: IEEE.

Dansereau, F., Alutto, J. A., & Yammarino, F. J. (1984). Theory testing in organizational behavior: The varient approach. Prentice Hall.

Díaz-Uriarte, R. (2005). Supervised methods with genomic data: a review and cautionary view. Data analysis and visualization in genomics and proteomics,(pp. 193–214).

Gu, Q., Li, Z., & Han, J. (2012). Generalized fisher score for feature selection. arXiv preprint arXiv:1202.3725.

Gulshan, V., Peng, L., Coram, M., Stumpe, M. C., Wu, D., Narayanaswamy, A., Venugopalan, S., Widner, K., Madams, T., & Cuadros, J. (2016). Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. Jama, 316(22), 2402–2410.

Hart, P. E., Stork, D. G., & Duda, R. O. (2000). Pattern classification. Wiley Hoboken.

Hua, J., Xiong, Z., Lowey, J., Suh, E., & Dougherty, E. R. (2005). Optimal number of features as a function of sample size for various classification rules. Bioinformatics, 21(8), 1509–1515.

Jirapech-Umpai, T. & Aitken, S. (2005). Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes. BMC bioinformatics, 6(1), 1–11.

Kamnitsas, K., Ledig, C., Newcombe, V. F., Simpson, J. P., Kane, A. D., Menon, D. K., Rueckert, D., & Glocker, B. (2017). Efficient multi-scale 3d cnn with fully connected crf for accurate brain lesion segmentation. Medical image analysis, 36, 61–78.

Khan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C. R., Peterson, C., et al. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature medicine, 7(6), 673–679.

Kim, K., Zhang, S., Jiang, K., Cai, L., Lee, I.-B., Feldman, L. J., & Huang, H. (2007). Measuring similarities between gene expression profiles through new data transformations. BMC bioinformatics, 8, 1–14.

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90.

Lee, J. W., Lee, J. B., Park, M., & Song, S. H. (2005). An extensive comparison of recent classification tools applied to microarray data. Computational Statistics & Data Analysis, 48(4), 869–885.

Li, Y., Campbell, C., & Tipping, M. (2002). Bayesian automatic relevance determination algorithms for classifying gene expression data. Bioinformatics, 18(10), 1332–1339.

Ma, S. & Zhang, Z. (2018). Omicsmapnet: Transforming omics data to take advantage of deep convolutional neural network for discovery. arXiv preprint arXiv:1804.05283.

Odena, A., Olah, C., & Shlens, J. (2017). Conditional image synthesis with auxiliary classifier gans. In International conference on machine learning (pp. 2642–2651).: PMLR.

Pomeroy, S. L., Tamayo, P., Gaasenbeek, M., Sturla, L. M., Angelo, M., McLaughlin, M. E., Kim, J. Y., Goumnerova, L. C., Black, P. M., Lau, C., et al. (2002). Prediction of central nervous system embryonal tumour outcome
based on gene expression. Nature, 415(6870), 436–442.

Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2013). Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229.

Sharma, A., Vans, E., Shigemizu, D., Boroevich, K. A., & Tsunoda, T. (2019). Deepinsight: A methodology to transform a non-image data to an image for convolution neural network architecture. Scientific reports, 9(1), 11399.

Simonyan, K. & Zisserman, A. (2014a). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

Simonyan, K. & Zisserman, A. (2014b). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

Singh, D., Febbo, P. G., Ross, K., Jackson, D. G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A. A., D’Amico, A. V., Richie, J. P., et al. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer cell, 1(2), 203–209.

Wainberg, M., Merico, D., Delong, A., & Frey, B. J. (2018). Deep learning in biomedicine. Nature biotechnology, 36(9), 829–838.

Wu, H.-M., Tien, Y.-J., Ho, M.-R., Hwu, H.-G., Lin, W.-c., Tao, M.-H., & Chen, C.-h. (2018). Covariate-adjusted heatmaps for visualizing biological data via correlation decomposition. Bioinformatics, 34(20), 3529–3538.

Yeung, K. Y., Bumgarner, R. E., & Raftery, A. E. (2005). Bayesian model averaging: development of an improved multi-class, gene selection and classification tool for microarray data. Bioinformatics, 21(10), 2394–2402.

Zhu, Y., Brettin, T., Xia, F., Partin, A., Shukla, M., Yoo, H., Evrard, Y. A., Doroshow, J. H., & Stevens, R. L. (2021). Converting tabular data into images for deep learning with convolutional neural networks. Scientific eports, 11(1), 11325.
zh_TW