隨機森林分類方法於基因組顯著性檢定上之應用

Publications-Theses

Article View/Open

pdf(912)

Publication Export

Google Scholar^TM

題名	隨機森林分類方法於基因組顯著性檢定上之應用 Assessing the significance of a Gene Set
作者	卓達瑋
貢獻者	薛慧敏<br>蔡政安卓達瑋
關鍵詞	外顯表型變數基因組分析隨機森林分類方法排列顯著值 phenotypes gene set analysis Random Forests permutation-based p-value
日期	2010
上傳時間	5-Sep-2013 15:12:54 (UTC+8)
摘要	在現今生物醫學領域中，一重要課題為透過基因實驗所獲得的量化資料，來研究與分析基因與外顯表型變數(phenotype)的相關性。已知多數已發展的方法皆屬於單基因分析法，無法適當的考慮基因之間的相關性。本研究主要針對基因組分析(gene set analysis)問題，提出統計檢定方法來驗證特定基因組的顯著性。為了能盡其所能的捕捉整體基因組與外顯表型變數的關係，我們結合了傳統的檢定方法與分類方法，提出以隨機森林分類方法(Random Forests)的測試組分類誤差值(test error)作為檢定統計量(test statistic)，並以其排列顯著值(permutation-based p-value)來獲得統計結論。我們透過模擬研究將本研究方法和其他七種基因組分析方法做比較，可發現本方法在型一誤差率(type I error rate)和檢定力(power)上皆有優異表現。最後，我們運用本方法在數個實際基因資料組的分析上，並深入探討所獲得結果。 Nowadays microarray data analysis has become an important issue in biomedical research. One major goal is to explore the relationship between gene expressions and some specific phenotypes. So far in literatures many developed methods are single gene-based methods, which use solely the information of individual genes and cannot appropriately take into account the relationship among genes. This research focuses on the gene set analysis, which carries out the statistical test for the significance of a set of genes to a phenotype. In order to capture the relationship between a gene set and the phenotype, we propose the use of performance of a complex classifier in the statistical test: The test error rate of a Random Forests classification is adopted as the test statistic, and the statistical conclusion is drawn according to its permutation-based p-value. We compare our test with other seven existing gene set analyses through simulation studies. It’s found that our method has leading performance in terms of having a controlled type I error rate and a high power. Finally, this method is applied in several real examples and brief discussions on the results are provided.
參考文獻	中文參考文獻: 葉清江、羅良岡、齊德彰和黃彥儒(2006):整合分類迴歸樹與隨機森林於資訊揭露預測之研究:公司治理之考量。台灣作業研究學會理論與實務學術研討會。英文參考文獻: Bhattacharjee, A., Richards, W.G.,Staunton, J.,Li, C., Monti, S.,Vasa, P.,Ladd, C. , Beheshti, J.,Bueno, R.,Gillette, M.,Loda, M.,Weber, G.,Mark, E.J.,Lander, E.S.,Wong, W., Johnson, B.E., Golub, T.R., Sugarbaker, D.J., Meyerson, M. (2001) Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc. Natl Aacd. Sci. USA, 98, 13790–13795. Breiman, L.(2001) Random forests. Mach. Learn., 45, 5–32. Chen, J.J.,Lee, T., Delonggchamp, R.R., Chen, T. and Tsao, C.A. (2007) Significance analysis of groups of genes in expression profiling studies. Bioinformatics, 23, 2104–2112. Dinu, I., Potter, J.D., Mueller, T.,Liu, Qi., Adewale, A.J., Jhangri, G.S. , Einecke, G. , Famulski, K.S., Halloran, P. and Yasui, Y. (2007) Improving gene set analysis of microarray data by SAM-GS . BMC Bioinformatics ,8,242. Efron, B. and Tibshirani, R. (2007)” On testing the significance of set s of genes”. Ann. Appl. Stat.,1, 107–129. Farmer, P., Bonnefoi, H., Becette, V., Tubiana-Hulin, M., Fumoleau, P., Larsimont, D., Macgrogan, G., Bergh, J., Cameron, D., Goldstein, D., Duss, S., Nicoulaz, A.L., Brisken, C., Fiche, M., Delorenzi, M., Iggo, R. (2005) Identification of molecular apocrine breast tumours by microarray analysis . Oncogene, 24, 4660–4671. Furey, T.S., Cristianini, N., Bednarski, D.W., Schummer, M. and Haussler, D. (2000) Support vector machine classification and validation of cancer tissue samples using microarray expression data . Bioinformatics,16,906–914. Fujita, N. et al.(2003) MTA3, a Mi-2/NuRD Complex Subunit, Regulates an Invasive Growth Pathway in Breast Cancer . Cell,113, 207–219. Glazko, G.V. and Emmert-Streib, F. (2009) Unite and conquer: univariate and multivariate approaches for finding differentially expressed gene sets. Bioinformatics , 25 , 18 Goeman, J.J. and Bühlmann, P.(2007) Analyzing gene expression data in terms of gene sets:methodological issues . Bioinformatics,16,906–914. Goeman, J.J.,Sara A. van de Geer , Floor de Kort and Hans C. van Houwelingen (2004) A global test for groups of genes: testing association with a clinical outcome . Bioinformatics, 20, 93–99 Harris, M.A., Clark, J., Ireland, A.,Lomax, J., Ashburner, M., Foulger, R., Eilbeck, K., Lewis, S., Marshall, B., Mungall, C. (2004) The Gene Ontology (GO) database and informatics resource.Nucleic Acids Res., 32, D258–D261. Kanehisa, M., Goto, S., Kawashima, S.,Okuno, Y. and Hattori, M. (2004) The KEGG resource for deciphering the genome . Nucleic Acids Res., 32, D277–D280. Kong, S.W.,Pu1, W.T. and Park, P.J. (2006) A multivariate approach for integrating genomewide expression data and biological knowledge . Bioinformatics, 22, 2373–2380. Liu, Q., Dinu, I., Adewale , A., Potter, J. and Yasui, Y..(2007) Comparative evaluation of gene-set analysis methods .BMC Bioinformatics,8:431. Mansmann, U. and Meister, R. (2005) Testing differential gene expression in functional groups: Goeman’s global test versus an ANCOVA approach . Method. Inform. in Med., 44, 449–453. Mehra, R.,Varambally, S., Ding, L., Shen R., Sabel, M.S., Ghosh, D., Chinnaiyan, A.M., Kleer, C.G. (2005) Identification of GATA3 as a breast cancer prognostic marker by global gene expression meta-analysis .Cancer Res., 65, 11259–11264. Menashe, I. , Maeder, D. , Garcia-Closas, M. , Figueroa, J.D., Bhattacharjee, S. Rotunno, M.,Kraft, P., Hunter, D.J., Chanock, S. J., Rosenberg, P.S., and Chatterjee, N. (2010) Pathway Analysis of Breast Cancer Genome-Wide Association Study Highlights Three Pathways and One Canonical Signaling Cascade .Cancer Research,DOI:10.1158/0008-5472. Mootha,V.K., Lindgren,C.M., Eriksson,K.F., Subramanian, A., Sihag, S., Lehar, J., Puigserver, P., Carlsson, E.,Ridderstråle, M., Laurila, E.,Houstis, N.,Daly, M.J., Patterson, N., Mesirov, J.P., Golub, T.R., Tamayo, P., Spiegelman, B.,Lander, E.S., Hirschhorn, J.N.,Altshuler, D., Groop, L.C. (2003) PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat. Genet.,34, 267–273. Motoyama,A.B. and Hynes,N.E. (2003) BAD: a good therapeutic target?. Breast Cancer Res., 5 (1), 27–30. Nam, D. and Kim, S.Y. (2008) Gene-set approach for expression pattern analysis . Brief.Bioinformatics, 9, 189–197 Ojala, M.and Garriga, G.C.(2010) Permutation Tests for Studying Classifier Performance . Journal of Machine Learning Research,11, 1833-1863 Pang, H.,Lin,A.,Holford,M.,Enerson, B.E.,Lu, B.,Lawton, M.P.,Floyd, E. and Zhao, H. (2006) Pathway analysis using random forests classification and regression. Bioinformatics, 22, 2028–2036. Qi, Y.and Klein-Seetharaman, J.(2006) Evaluation of different biological data and computational classification methods for use in protein interaction prediction . Proteins, 63, 490–500. Rajagopalan, D. and Agarwal, P.(2005) Inferring pathways from gene lists using a literature-derived network of biological relationships . Bioinformatics, 21, 788–793. Schramm, G. , Surmann, E.M., Wiesberg, S., Oswald, M. , Reinelt, G., Eils, R. and Rainer, K.(2010) Analyzing the regulation of metabolic pathways in human breast cancer .BMC Medical Genomics,3:39. Shao, W. and Brown, M. (2004) Advances in estrogen receptor biology: Prospects for improvements in targeted breast cancer therapy . Breast Cancer Res., 6, 39–52. Sotiriou, C., Neo, S.Y. , McShane, L.M., Korn, E.L., Long, P.M., Jazaeri, A., Martiat, P., Fox, S.B., Harris,A.L. and Liu, E.T. (2003) Breast cancer classification and prognosis based on gene expression profiles from a population-based study . Proc. Natl Aacd. Sci. USA,100, 10393–10398. Subramanian, A., Tamayo, P., Mootha, V.K. Mukherjee, S., Ebert, B.L., Gillette, M.A. Paulovich, A., Pomeroy, S.L., Golub, T.R. , Lander, E.S. and Mesirov, J.P. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles . Proc Natl Acad Sci USA ,102,15545-15550. Tian,L., Greenberg,S.A., Kong,S.W. Altschuler, J., Kohane, I.S. and Park, P. J. (2005) Discovering statistically significant pathways in expression profiling studies . Proc. Natl Acad. of Sci. USA, 102, 13544–13549. Tsai, C.A., and Chen, J.J. (2009) Multivariate analysis of variance test for gene set analysis . Bioinformatics, 25, 897–903. Tibshirani, R. (1996) Bias, variance, and prediction error for classification rules . Technical Report, Statistics Department, University of Toronto. Tusher, V.G., Tibshirani, R. and Chu, G. (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA, 98, 5116–5121 Wolpert, D. H. and Macready, W.G. (1999) An efficient method to estimate Bagging’s generalization error . Mach. Learn., 35, 41–55. Wright, G. , Tan, B. , Rosenwald, A. , Hurt, E H. , Wiestner, A., and Staudt, L.M. (2003) A gene expression-based method to diagnose clinically distinct subgroups of diffuse large B cell lymphoma . Proc. Natl Acad. Sci. USA, 100,9991–9996.
描述	碩士國立政治大學統計研究所 98354014 99
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0098354014
資料類型	thesis

dc.contributor.advisor	薛慧敏<br>蔡政安	zh_TW
dc.contributor.author (Authors)	卓達瑋	zh_TW
dc.creator (作者)	卓達瑋	zh_TW
dc.date (日期)	2010	en_US
dc.date.accessioned	5-Sep-2013 15:12:54 (UTC+8)	-
dc.date.available	5-Sep-2013 15:12:54 (UTC+8)	-
dc.date.issued (上傳時間)	5-Sep-2013 15:12:54 (UTC+8)	-
dc.identifier (Other Identifiers)	G0098354014	en_US
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/60442	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	統計研究所	zh_TW
dc.description (描述)	98354014	zh_TW
dc.description (描述)	99	zh_TW
dc.description.abstract (摘要)	在現今生物醫學領域中，一重要課題為透過基因實驗所獲得的量化資料，來研究與分析基因與外顯表型變數(phenotype)的相關性。已知多數已發展的方法皆屬於單基因分析法，無法適當的考慮基因之間的相關性。本研究主要針對基因組分析(gene set analysis)問題，提出統計檢定方法來驗證特定基因組的顯著性。為了能盡其所能的捕捉整體基因組與外顯表型變數的關係，我們結合了傳統的檢定方法與分類方法，提出以隨機森林分類方法(Random Forests)的測試組分類誤差值(test error)作為檢定統計量(test statistic)，並以其排列顯著值(permutation-based p-value)來獲得統計結論。我們透過模擬研究將本研究方法和其他七種基因組分析方法做比較，可發現本方法在型一誤差率(type I error rate)和檢定力(power)上皆有優異表現。最後，我們運用本方法在數個實際基因資料組的分析上，並深入探討所獲得結果。	zh_TW
dc.description.abstract (摘要)	Nowadays microarray data analysis has become an important issue in biomedical research. One major goal is to explore the relationship between gene expressions and some specific phenotypes. So far in literatures many developed methods are single gene-based methods, which use solely the information of individual genes and cannot appropriately take into account the relationship among genes. This research focuses on the gene set analysis, which carries out the statistical test for the significance of a set of genes to a phenotype. In order to capture the relationship between a gene set and the phenotype, we propose the use of performance of a complex classifier in the statistical test: The test error rate of a Random Forests classification is adopted as the test statistic, and the statistical conclusion is drawn according to its permutation-based p-value. We compare our test with other seven existing gene set analyses through simulation studies. It’s found that our method has leading performance in terms of having a controlled type I error rate and a high power. Finally, this method is applied in several real examples and brief discussions on the results are provided.	en_US
dc.description.tableofcontents	第一章、緒論 1 第二章、資料介紹與顯著性假設檢定 4 第一節、資料介紹與基因組顯著性檢定 4 第二節、隨機森林分類方法 5 第三節、顯著性檢定方法 6 第三章、模擬研究 12 第四章、實證分析 19 第一節、基因資料介紹及說明 19 第二節、實證分析內容和結果 20 第三節、收斂問題 33 第五章、結論與建議 36 參考文獻 40 附錄 44 附錄A .......................................................44 附錄B .......................................................46	zh_TW
dc.format.extent	1068589 bytes	-
dc.format.mimetype	application/pdf	-
dc.language.iso	en_US	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0098354014	en_US
dc.subject (關鍵詞)	外顯表型變數	zh_TW
dc.subject (關鍵詞)	基因組分析	zh_TW
dc.subject (關鍵詞)	隨機森林分類方法	zh_TW
dc.subject (關鍵詞)	排列顯著值	zh_TW
dc.subject (關鍵詞)	phenotypes	en_US
dc.subject (關鍵詞)	gene set analysis	en_US
dc.subject (關鍵詞)	Random Forests	en_US
dc.subject (關鍵詞)	permutation-based p-value	en_US
dc.title (題名)	隨機森林分類方法於基因組顯著性檢定上之應用	zh_TW
dc.title (題名)	Assessing the significance of a Gene Set	en_US
dc.type (資料類型)	thesis	en
dc.relation.reference (參考文獻)	中文參考文獻: 葉清江、羅良岡、齊德彰和黃彥儒(2006):整合分類迴歸樹與隨機森林於資訊揭露預測之研究:公司治理之考量。台灣作業研究學會理論與實務學術研討會。英文參考文獻: Bhattacharjee, A., Richards, W.G.,Staunton, J.,Li, C., Monti, S.,Vasa, P.,Ladd, C. , Beheshti, J.,Bueno, R.,Gillette, M.,Loda, M.,Weber, G.,Mark, E.J.,Lander, E.S.,Wong, W., Johnson, B.E., Golub, T.R., Sugarbaker, D.J., Meyerson, M. (2001) Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc. Natl Aacd. Sci. USA, 98, 13790–13795. Breiman, L.(2001) Random forests. Mach. Learn., 45, 5–32. Chen, J.J.,Lee, T., Delonggchamp, R.R., Chen, T. and Tsao, C.A. (2007) Significance analysis of groups of genes in expression profiling studies. Bioinformatics, 23, 2104–2112. Dinu, I., Potter, J.D., Mueller, T.,Liu, Qi., Adewale, A.J., Jhangri, G.S. , Einecke, G. , Famulski, K.S., Halloran, P. and Yasui, Y. (2007) Improving gene set analysis of microarray data by SAM-GS . BMC Bioinformatics ,8,242. Efron, B. and Tibshirani, R. (2007)” On testing the significance of set s of genes”. Ann. Appl. Stat.,1, 107–129. Farmer, P., Bonnefoi, H., Becette, V., Tubiana-Hulin, M., Fumoleau, P., Larsimont, D., Macgrogan, G., Bergh, J., Cameron, D., Goldstein, D., Duss, S., Nicoulaz, A.L., Brisken, C., Fiche, M., Delorenzi, M., Iggo, R. (2005) Identification of molecular apocrine breast tumours by microarray analysis . Oncogene, 24, 4660–4671. Furey, T.S., Cristianini, N., Bednarski, D.W., Schummer, M. and Haussler, D. (2000) Support vector machine classification and validation of cancer tissue samples using microarray expression data . Bioinformatics,16,906–914. Fujita, N. et al.(2003) MTA3, a Mi-2/NuRD Complex Subunit, Regulates an Invasive Growth Pathway in Breast Cancer . Cell,113, 207–219. Glazko, G.V. and Emmert-Streib, F. (2009) Unite and conquer: univariate and multivariate approaches for finding differentially expressed gene sets. Bioinformatics , 25 , 18 Goeman, J.J. and Bühlmann, P.(2007) Analyzing gene expression data in terms of gene sets:methodological issues . Bioinformatics,16,906–914. Goeman, J.J.,Sara A. van de Geer , Floor de Kort and Hans C. van Houwelingen (2004) A global test for groups of genes: testing association with a clinical outcome . Bioinformatics, 20, 93–99 Harris, M.A., Clark, J., Ireland, A.,Lomax, J., Ashburner, M., Foulger, R., Eilbeck, K., Lewis, S., Marshall, B., Mungall, C. (2004) The Gene Ontology (GO) database and informatics resource.Nucleic Acids Res., 32, D258–D261. Kanehisa, M., Goto, S., Kawashima, S.,Okuno, Y. and Hattori, M. (2004) The KEGG resource for deciphering the genome . Nucleic Acids Res., 32, D277–D280. Kong, S.W.,Pu1, W.T. and Park, P.J. (2006) A multivariate approach for integrating genomewide expression data and biological knowledge . Bioinformatics, 22, 2373–2380. Liu, Q., Dinu, I., Adewale , A., Potter, J. and Yasui, Y..(2007) Comparative evaluation of gene-set analysis methods .BMC Bioinformatics,8:431. Mansmann, U. and Meister, R. (2005) Testing differential gene expression in functional groups: Goeman’s global test versus an ANCOVA approach . Method. Inform. in Med., 44, 449–453. Mehra, R.,Varambally, S., Ding, L., Shen R., Sabel, M.S., Ghosh, D., Chinnaiyan, A.M., Kleer, C.G. (2005) Identification of GATA3 as a breast cancer prognostic marker by global gene expression meta-analysis .Cancer Res., 65, 11259–11264. Menashe, I. , Maeder, D. , Garcia-Closas, M. , Figueroa, J.D., Bhattacharjee, S. Rotunno, M.,Kraft, P., Hunter, D.J., Chanock, S. J., Rosenberg, P.S., and Chatterjee, N. (2010) Pathway Analysis of Breast Cancer Genome-Wide Association Study Highlights Three Pathways and One Canonical Signaling Cascade .Cancer Research,DOI:10.1158/0008-5472. Mootha,V.K., Lindgren,C.M., Eriksson,K.F., Subramanian, A., Sihag, S., Lehar, J., Puigserver, P., Carlsson, E.,Ridderstråle, M., Laurila, E.,Houstis, N.,Daly, M.J., Patterson, N., Mesirov, J.P., Golub, T.R., Tamayo, P., Spiegelman, B.,Lander, E.S., Hirschhorn, J.N.,Altshuler, D., Groop, L.C. (2003) PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat. Genet.,34, 267–273. Motoyama,A.B. and Hynes,N.E. (2003) BAD: a good therapeutic target?. Breast Cancer Res., 5 (1), 27–30. Nam, D. and Kim, S.Y. (2008) Gene-set approach for expression pattern analysis . Brief.Bioinformatics, 9, 189–197 Ojala, M.and Garriga, G.C.(2010) Permutation Tests for Studying Classifier Performance . Journal of Machine Learning Research,11, 1833-1863 Pang, H.,Lin,A.,Holford,M.,Enerson, B.E.,Lu, B.,Lawton, M.P.,Floyd, E. and Zhao, H. (2006) Pathway analysis using random forests classification and regression. Bioinformatics, 22, 2028–2036. Qi, Y.and Klein-Seetharaman, J.(2006) Evaluation of different biological data and computational classification methods for use in protein interaction prediction . Proteins, 63, 490–500. Rajagopalan, D. and Agarwal, P.(2005) Inferring pathways from gene lists using a literature-derived network of biological relationships . Bioinformatics, 21, 788–793. Schramm, G. , Surmann, E.M., Wiesberg, S., Oswald, M. , Reinelt, G., Eils, R. and Rainer, K.(2010) Analyzing the regulation of metabolic pathways in human breast cancer .BMC Medical Genomics,3:39. Shao, W. and Brown, M. (2004) Advances in estrogen receptor biology: Prospects for improvements in targeted breast cancer therapy . Breast Cancer Res., 6, 39–52. Sotiriou, C., Neo, S.Y. , McShane, L.M., Korn, E.L., Long, P.M., Jazaeri, A., Martiat, P., Fox, S.B., Harris,A.L. and Liu, E.T. (2003) Breast cancer classification and prognosis based on gene expression profiles from a population-based study . Proc. Natl Aacd. Sci. USA,100, 10393–10398. Subramanian, A., Tamayo, P., Mootha, V.K. Mukherjee, S., Ebert, B.L., Gillette, M.A. Paulovich, A., Pomeroy, S.L., Golub, T.R. , Lander, E.S. and Mesirov, J.P. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles . Proc Natl Acad Sci USA ,102,15545-15550. Tian,L., Greenberg,S.A., Kong,S.W. Altschuler, J., Kohane, I.S. and Park, P. J. (2005) Discovering statistically significant pathways in expression profiling studies . Proc. Natl Acad. of Sci. USA, 102, 13544–13549. Tsai, C.A., and Chen, J.J. (2009) Multivariate analysis of variance test for gene set analysis . Bioinformatics, 25, 897–903. Tibshirani, R. (1996) Bias, variance, and prediction error for classification rules . Technical Report, Statistics Department, University of Toronto. Tusher, V.G., Tibshirani, R. and Chu, G. (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA, 98, 5116–5121 Wolpert, D. H. and Macready, W.G. (1999) An efficient method to estimate Bagging’s generalization error . Mach. Learn., 35, 41–55. Wright, G. , Tan, B. , Rosenwald, A. , Hurt, E H. , Wiestner, A., and Staudt, L.M. (2003) A gene expression-based method to diagnose clinically distinct subgroups of diffuse large B cell lymphoma . Proc. Natl Acad. Sci. USA, 100,9991–9996.	zh_TW

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM