高維不平衡基因資料的變數選取

Publications-Theses

Article View/Open

pdf(0)

Publication Export

Google Scholar^TM

NCCU Library

Discovery System

Citation Infomation

Related Publications in TAIR

Simple Record
Full Record

題名	高維不平衡基因資料的變數選取 Feature selection for high-dimensional imbalanced microarray data
作者	董承 Tung, Chen
貢獻者	周珮婷 CHOU, PEI-TING 董承 Tung, Chen
關鍵詞	不平衡資料高維度資料基因微陣列資料雙分群方法變數選取 Imbalanced data High-dimensional data Microarray data Biclustering algorithm Feature selection
日期	2019
上傳時間	7-Aug-2019 16:01:51 (UTC+8)
摘要	不平衡資料在各個領域中是一種常見的資料型態，少數類別通常是主要研究的目標，例如：異常偵測、風險管控、醫療診斷等領域。基因微陣列資料是利用生物晶片提取基因表現情形將其數據化，並對其進行研究分析，而此資料之特色為樣本數少卻有非常高的維度。本研究基於以上兩者之問題，對高維不平衡之基因微陣列資料，以雙分群方法之概念做變數選取，並且與F-test method、Cho’s method以及使用全部變數做比較，研究結果顯示本研究方法與F-test method表現接近且優於Cho’s method和使用全部變數。 Imbalanced data is a common data type in different fields, for example, novelty detection, risk management, medical diagnosis and so on. In these data types, minority class is usually the main target to study. In this study, we focus on microarray data. Microarray data is obtained by using biochips to extract gene expression, and then analyze it. The characteristics of this data is that the sample size is small but with a very high dimension. Based on the problems above, this study selects features of high-dimensional imbalanced microarray data by the concept of biclustering algorithm, and compares it with the F-test method, the Cho`s method, and using all variables. The performance of proposed method is similar to the F-test method and superior to the Cho`s method and using all variables.
參考文獻	Akbani, R., Kwek, S., & Japkowicz, N. (2004). Applying support vector machines to imbalanced datasets. Paper presented at the European conference on machine learning. Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., & Levine, A. J. J. P. o. t. N. A. o. S. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. 96(12), 6745-6750. Bellman, R. J. S. (1966). Dynamic programming. 153(3731), 34-37. Blum, A. L., & Langley, P. J. A. i. (1997). Selection of relevant features and examples in machine learning. 97(1-2), 245-271. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. J. J. o. a. i. r. (2002). SMOTE: synthetic minority over-sampling technique. 16, 321-357. Chen, J.-X., Cheng, T.-H., Chan, A. L., & Wang, H.-Y. (2004). An application of classification analysis for skewed class distribution in therapeutic drug monitoring-the case of vancomycin. Paper presented at the 2004 IDEAS Workshop on Medical Information Systems: The Digital Hospital (IDEAS-DH`04). Cho, J.-H., Lee, D., Park, J. H., & Lee, I.-B. J. F. l. (2003). New gene selection method for classification of cancer subtypes considering within‐class variation. 551(1-3), 3-7. Cohen, G., Hilario, M., Sax, H., & Hugonnet, S. (2003). Data imbalance in surveillance of nosocomial infections. Paper presented at the International Symposium on Medical Data Analysis. Cortes, C., & Vapnik, V. J. M. l. (1995). Support-vector networks. 20(3), 273-297. Das, S. (2001). Filters, wrappers and a boosting-based hybrid for feature selection. Paper presented at the Icml. Del Castillo, M. D., & Serrano, J. I. J. A. S. E. N. (2004). A multistrategy approach for digital text categorization from imbalanced documents. 6(1), 70-79. Ding, C., Peng, H. J. J. o. b., & biology, c. (2005). Minimum redundancy feature selection from microarray gene expression data. 3(02), 185-205. Dudoit, S., Fridlyand, J., & Speed, T. P. J. J. o. t. A. s. a. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. 97(457), 77-87. Fodor, S. P., Read, J. L., Pirrung, M. C., Stryer, L., Lu, A. T., & Solas, D. J. s. (1991). Light-directed, spatially addressable parallel chemical synthesis. 251(4995), 767-773. Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. Paper presented at the icml. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., . . . Caligiuri, M. A. J. s. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. 286(5439), 531-537. Gordon, G. J., Jensen, R. V., Hsiao, L.-L., Gullans, S. R., Blumenstock, J. E., Ramaswamy, S., . . . Bueno, R. J. C. r. (2002). Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. 62(17), 4963-4967. Gravier, E., Pierron, G., Vincent‐Salomon, A., Gruel, N., Raynal, V., Savignoni, A., . . . Cancer. (2010). A prognostic DNA signature for T1T2 node‐negative breast cancer patients. 49(12), 1125-1134. Hartigan, J. A. J. J. o. t. a. s. a. (1972). Direct clustering of a data matrix. 67(337), 123-129. He, H., Garcia, E. A. J. I. T. o. K., & Engineering, D. (2008). Learning from imbalanced data. (9), 1263-1284. Hira, Z. M., & Gillies, D. F. J. A. i. b. (2015). A review of feature selection and feature extraction methods applied on microarray data. 2015. Hong, X., Chen, S., & Harris, C. J. J. I. T. o. n. n. (2007). A kernel-based two-class classifier for imbalanced data sets. 18(1), 28-41. Japkowicz, N., & Stephen, S. J. I. d. a. (2002). The class imbalance problem: A systematic study. 6(5), 429-449. Japkowicz, N. J. M. L. (2001). Supervised versus unsupervised binary-learning by feedforward neural networks. 42(1-2), 97-122. Kotsiantis, S., Kanellopoulos, D., Pintelas, P. J. G. I. T. o. C. S., & Engineering. (2006). Handling imbalanced datasets: A review. 30(1), 25-36. Kubat, M., Holte, R. C., & Matwin, S. J. M. l. (1998). Machine learning for the detection of oil spills in satellite radar images. 30(2-3), 195-215. Liu, X.-Y., Wu, J., Zhou, Z.-H. J. I. T. o. S., Man,, & Cybernetics, P. B. (2008). Exploratory undersampling for class-imbalance learning. 39(2), 539-550. Liu, X.-Y., & Zhou, Z.-H. (2006). The influence of class imbalance on cost-sensitive learning: An empirical study. Paper presented at the Sixth International Conference on Data Mining (ICDM`06). Lusa, L. J. B. b. (2010). Class prediction for high-dimensional class-imbalanced data. 11(1), 523. Mani, I., & Zhang, I. (2003). kNN approach to unbalanced data distributions: a case study involving information extraction. Paper presented at the Proceedings of workshop on learning from imbalanced datasets. McCarthy, K., Zabar, B., & Weiss, G. (2005). Does cost-sensitive learning beat sampling for classifying rare classes? Paper presented at the Proceedings of the 1st international workshop on Utility-based data mining. Pérez, J. M., Muguerza, J., Arbelaitz, O., Gurrutxaga, I., & Martín, J. I. (2005). Consolidated tree classifier learning in a car insurance fraud detection domain with class imbalance. Paper presented at the International Conference on Pattern Recognition and Image Analysis. Phua, C., Alahakoon, D., & Lee, V. J. A. s. e. n. (2004). Minority report in fraud detection: classification of skewed data. 6(1), 50-59. Pudil, P., Novovičová, J., & Kittler, J. J. P. r. l. (1994). Floating search methods in feature selection. 15(11), 1119-1125. Radivojac, P., Korad, U., Sivalingam, K. M., & Obradovic, Z. (2003). Learning from class-imbalanced data in wireless sensor networks. Paper presented at the 2003 IEEE 58th Vehicular Technology Conference. VTC 2003-Fall (IEEE Cat. No. 03CH37484). Ramey, J. (2016). Datamicroarray: collection of data sets for classification. In: URL https://github. com/boost-R/datamicroarray. Raskutti, B., & Kowalczyk, A. J. A. S. E. N. (2004). Extreme re-balancing for SVMs: a case study. 6(1), 60-69. Saeys, Y., Inza, I., & Larrañaga, P. J. b. (2007). A review of feature selection techniques in bioinformatics. 23(19), 2507-2517. Shipp, M. A., Ross, K. N., Tamayo, P., Weng, A. P., Kutok, J. L., Aguiar, R. C., . . . Pinkus, G. S. J. N. m. (2002). Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. 8(1), 68. Singh, D., Febbo, P. G., Ross, K., Jackson, D. G., Manola, J., Ladd, C., . . . Richie, J. P. J. C. c. (2002). Gene expression correlates of clinical prostate cancer behavior. 1(2), 203-209. Sun, Y., Kamel, M. S., Wong, A. K., & Wang, Y. J. P. R. (2007). Cost-sensitive boosting for classification of imbalanced data. 40(12), 3358-3378. West, M., Blanchette, C., Dressman, H., Huang, E., Ishida, S., Spang, R., . . . Nevins, J. R. J. P. o. t. N. A. o. S. (2001). Predicting the clinical status of human breast cancer by using gene expression profiles. 98(20), 11462-11467. Yang, K., Cai, Z., Li, J., & Lin, G. (2006). A stable gene selection in microarray data analysis. BMC bioinformatics, 7(1), 228. Yoon, K., & Kwek, S. (2005). An unsupervised learning approach to resolving the data imbalanced issue in supervised learning problems in functional genomics. Paper presented at the Fifth International Conference on Hybrid Intelligent Systems (HIS`05). Yuan, J., Li, J., & Zhang, B. (2006). Learning concepts from large scale imbalanced data sets using support cluster machines. Paper presented at the Proceedings of the 14th ACM international conference on Multimedia. Zheng, Z., Wu, X., & Srihari, R. J. A. S. E. N. (2004). Feature selection for text categorization on imbalanced data. 6(1), 80-89. Zhou, Z.-H., Liu, X.-Y. J. I. T. o. K., & Engineering, D. (2006). Training cost-sensitive neural networks with methods addressing the class imbalance problem. (1), 63-77. Zou, K. H., O’Malley, A. J., & Mauri, L. J. C. (2007). Receiver-operating characteristic analysis for evaluating diagnostic tests and predictive models. 115(5), 654-657.
描述	碩士國立政治大學統計學系 106354014
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0106354014
資料類型	thesis

dc.contributor.advisor	周珮婷	zh_TW
dc.contributor.advisor	CHOU, PEI-TING	en_US
dc.contributor.author (Authors)	董承	zh_TW
dc.contributor.author (Authors)	Tung, Chen	en_US
dc.creator (作者)	董承	zh_TW
dc.creator (作者)	Tung, Chen	en_US
dc.date (日期)	2019	en_US
dc.date.accessioned	7-Aug-2019 16:01:51 (UTC+8)	-
dc.date.available	7-Aug-2019 16:01:51 (UTC+8)	-
dc.date.issued (上傳時間)	7-Aug-2019 16:01:51 (UTC+8)	-
dc.identifier (Other Identifiers)	G0106354014	en_US
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/124685	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	統計學系	zh_TW
dc.description (描述)	106354014	zh_TW
dc.description.abstract (摘要)	不平衡資料在各個領域中是一種常見的資料型態，少數類別通常是主要研究的目標，例如：異常偵測、風險管控、醫療診斷等領域。基因微陣列資料是利用生物晶片提取基因表現情形將其數據化，並對其進行研究分析，而此資料之特色為樣本數少卻有非常高的維度。本研究基於以上兩者之問題，對高維不平衡之基因微陣列資料，以雙分群方法之概念做變數選取，並且與F-test method、Cho’s method以及使用全部變數做比較，研究結果顯示本研究方法與F-test method表現接近且優於Cho’s method和使用全部變數。	zh_TW
dc.description.abstract (摘要)	Imbalanced data is a common data type in different fields, for example, novelty detection, risk management, medical diagnosis and so on. In these data types, minority class is usually the main target to study. In this study, we focus on microarray data. Microarray data is obtained by using biochips to extract gene expression, and then analyze it. The characteristics of this data is that the sample size is small but with a very high dimension. Based on the problems above, this study selects features of high-dimensional imbalanced microarray data by the concept of biclustering algorithm, and compares it with the F-test method, the Cho`s method, and using all variables. The performance of proposed method is similar to the F-test method and superior to the Cho`s method and using all variables.	en_US
dc.description.tableofcontents	第一章緒論 1 第二章文獻探討 3 第一節不平衡資料之分類問題 3 第二節基因微陣列簡介 5 第三節基因微陣列資料之變數選取 6 第四節雙分群方法 7 第三章研究方法與過程 9 第一節所使用之演算法 9 第二節分類評估指標 11 第三節研究方法 14 第四節 F-test & Cho’s Method 15 第四章研究結果與分析 17 第一節基因微陣列資料 17 第二節模擬資料 18 第三節基因微陣列資料研究結果 19 第四節模擬資料研究結果 26 第五章結論與建議 31 第一節結論 31 第二節未來研究方向與建議 32 參考文獻 33 附錄 37	zh_TW
dc.format.extent	1944743 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0106354014	en_US
dc.subject (關鍵詞)	不平衡資料	zh_TW
dc.subject (關鍵詞)	高維度資料	zh_TW
dc.subject (關鍵詞)	基因微陣列資料	zh_TW
dc.subject (關鍵詞)	雙分群方法	zh_TW
dc.subject (關鍵詞)	變數選取	zh_TW
dc.subject (關鍵詞)	Imbalanced data	en_US
dc.subject (關鍵詞)	High-dimensional data	en_US
dc.subject (關鍵詞)	Microarray data	en_US
dc.subject (關鍵詞)	Biclustering algorithm	en_US
dc.subject (關鍵詞)	Feature selection	en_US
dc.title (題名)	高維不平衡基因資料的變數選取	zh_TW
dc.title (題名)	Feature selection for high-dimensional imbalanced microarray data	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	Akbani, R., Kwek, S., & Japkowicz, N. (2004). Applying support vector machines to imbalanced datasets. Paper presented at the European conference on machine learning. Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., & Levine, A. J. J. P. o. t. N. A. o. S. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. 96(12), 6745-6750. Bellman, R. J. S. (1966). Dynamic programming. 153(3731), 34-37. Blum, A. L., & Langley, P. J. A. i. (1997). Selection of relevant features and examples in machine learning. 97(1-2), 245-271. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. J. J. o. a. i. r. (2002). SMOTE: synthetic minority over-sampling technique. 16, 321-357. Chen, J.-X., Cheng, T.-H., Chan, A. L., & Wang, H.-Y. (2004). An application of classification analysis for skewed class distribution in therapeutic drug monitoring-the case of vancomycin. Paper presented at the 2004 IDEAS Workshop on Medical Information Systems: The Digital Hospital (IDEAS-DH`04). Cho, J.-H., Lee, D., Park, J. H., & Lee, I.-B. J. F. l. (2003). New gene selection method for classification of cancer subtypes considering within‐class variation. 551(1-3), 3-7. Cohen, G., Hilario, M., Sax, H., & Hugonnet, S. (2003). Data imbalance in surveillance of nosocomial infections. Paper presented at the International Symposium on Medical Data Analysis. Cortes, C., & Vapnik, V. J. M. l. (1995). Support-vector networks. 20(3), 273-297. Das, S. (2001). Filters, wrappers and a boosting-based hybrid for feature selection. Paper presented at the Icml. Del Castillo, M. D., & Serrano, J. I. J. A. S. E. N. (2004). A multistrategy approach for digital text categorization from imbalanced documents. 6(1), 70-79. Ding, C., Peng, H. J. J. o. b., & biology, c. (2005). Minimum redundancy feature selection from microarray gene expression data. 3(02), 185-205. Dudoit, S., Fridlyand, J., & Speed, T. P. J. J. o. t. A. s. a. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. 97(457), 77-87. Fodor, S. P., Read, J. L., Pirrung, M. C., Stryer, L., Lu, A. T., & Solas, D. J. s. (1991). Light-directed, spatially addressable parallel chemical synthesis. 251(4995), 767-773. Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. Paper presented at the icml. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., . . . Caligiuri, M. A. J. s. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. 286(5439), 531-537. Gordon, G. J., Jensen, R. V., Hsiao, L.-L., Gullans, S. R., Blumenstock, J. E., Ramaswamy, S., . . . Bueno, R. J. C. r. (2002). Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. 62(17), 4963-4967. Gravier, E., Pierron, G., Vincent‐Salomon, A., Gruel, N., Raynal, V., Savignoni, A., . . . Cancer. (2010). A prognostic DNA signature for T1T2 node‐negative breast cancer patients. 49(12), 1125-1134. Hartigan, J. A. J. J. o. t. a. s. a. (1972). Direct clustering of a data matrix. 67(337), 123-129. He, H., Garcia, E. A. J. I. T. o. K., & Engineering, D. (2008). Learning from imbalanced data. (9), 1263-1284. Hira, Z. M., & Gillies, D. F. J. A. i. b. (2015). A review of feature selection and feature extraction methods applied on microarray data. 2015. Hong, X., Chen, S., & Harris, C. J. J. I. T. o. n. n. (2007). A kernel-based two-class classifier for imbalanced data sets. 18(1), 28-41. Japkowicz, N., & Stephen, S. J. I. d. a. (2002). The class imbalance problem: A systematic study. 6(5), 429-449. Japkowicz, N. J. M. L. (2001). Supervised versus unsupervised binary-learning by feedforward neural networks. 42(1-2), 97-122. Kotsiantis, S., Kanellopoulos, D., Pintelas, P. J. G. I. T. o. C. S., & Engineering. (2006). Handling imbalanced datasets: A review. 30(1), 25-36. Kubat, M., Holte, R. C., & Matwin, S. J. M. l. (1998). Machine learning for the detection of oil spills in satellite radar images. 30(2-3), 195-215. Liu, X.-Y., Wu, J., Zhou, Z.-H. J. I. T. o. S., Man,, & Cybernetics, P. B. (2008). Exploratory undersampling for class-imbalance learning. 39(2), 539-550. Liu, X.-Y., & Zhou, Z.-H. (2006). The influence of class imbalance on cost-sensitive learning: An empirical study. Paper presented at the Sixth International Conference on Data Mining (ICDM`06). Lusa, L. J. B. b. (2010). Class prediction for high-dimensional class-imbalanced data. 11(1), 523. Mani, I., & Zhang, I. (2003). kNN approach to unbalanced data distributions: a case study involving information extraction. Paper presented at the Proceedings of workshop on learning from imbalanced datasets. McCarthy, K., Zabar, B., & Weiss, G. (2005). Does cost-sensitive learning beat sampling for classifying rare classes? Paper presented at the Proceedings of the 1st international workshop on Utility-based data mining. Pérez, J. M., Muguerza, J., Arbelaitz, O., Gurrutxaga, I., & Martín, J. I. (2005). Consolidated tree classifier learning in a car insurance fraud detection domain with class imbalance. Paper presented at the International Conference on Pattern Recognition and Image Analysis. Phua, C., Alahakoon, D., & Lee, V. J. A. s. e. n. (2004). Minority report in fraud detection: classification of skewed data. 6(1), 50-59. Pudil, P., Novovičová, J., & Kittler, J. J. P. r. l. (1994). Floating search methods in feature selection. 15(11), 1119-1125. Radivojac, P., Korad, U., Sivalingam, K. M., & Obradovic, Z. (2003). Learning from class-imbalanced data in wireless sensor networks. Paper presented at the 2003 IEEE 58th Vehicular Technology Conference. VTC 2003-Fall (IEEE Cat. No. 03CH37484). Ramey, J. (2016). Datamicroarray: collection of data sets for classification. In: URL https://github. com/boost-R/datamicroarray. Raskutti, B., & Kowalczyk, A. J. A. S. E. N. (2004). Extreme re-balancing for SVMs: a case study. 6(1), 60-69. Saeys, Y., Inza, I., & Larrañaga, P. J. b. (2007). A review of feature selection techniques in bioinformatics. 23(19), 2507-2517. Shipp, M. A., Ross, K. N., Tamayo, P., Weng, A. P., Kutok, J. L., Aguiar, R. C., . . . Pinkus, G. S. J. N. m. (2002). Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. 8(1), 68. Singh, D., Febbo, P. G., Ross, K., Jackson, D. G., Manola, J., Ladd, C., . . . Richie, J. P. J. C. c. (2002). Gene expression correlates of clinical prostate cancer behavior. 1(2), 203-209. Sun, Y., Kamel, M. S., Wong, A. K., & Wang, Y. J. P. R. (2007). Cost-sensitive boosting for classification of imbalanced data. 40(12), 3358-3378. West, M., Blanchette, C., Dressman, H., Huang, E., Ishida, S., Spang, R., . . . Nevins, J. R. J. P. o. t. N. A. o. S. (2001). Predicting the clinical status of human breast cancer by using gene expression profiles. 98(20), 11462-11467. Yang, K., Cai, Z., Li, J., & Lin, G. (2006). A stable gene selection in microarray data analysis. BMC bioinformatics, 7(1), 228. Yoon, K., & Kwek, S. (2005). An unsupervised learning approach to resolving the data imbalanced issue in supervised learning problems in functional genomics. Paper presented at the Fifth International Conference on Hybrid Intelligent Systems (HIS`05). Yuan, J., Li, J., & Zhang, B. (2006). Learning concepts from large scale imbalanced data sets using support cluster machines. Paper presented at the Proceedings of the 14th ACM international conference on Multimedia. Zheng, Z., Wu, X., & Srihari, R. J. A. S. E. N. (2004). Feature selection for text categorization on imbalanced data. 6(1), 80-89. Zhou, Z.-H., Liu, X.-Y. J. I. T. o. K., & Engineering, D. (2006). Training cost-sensitive neural networks with methods addressing the class imbalance problem. (1), 63-77. Zou, K. H., O’Malley, A. J., & Mauri, L. J. C. (2007). Receiver-operating characteristic analysis for evaluating diagnostic tests and predictive models. 115(5), 654-657.	zh_TW
dc.identifier.doi (DOI)	10.6814/NCCU201900460	en_US

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM