學術產出-NSC Projects

Article View/Open

Publication Export

Google ScholarTM

政大圖書館

Citation Infomation

  • No doi shows Citation Infomation
題名 運用隨機森林分類方法在檢定基因集的顯著性
其他題名 Using Random Forest in Testing the Significance of a Gene-Set
作者 薛慧敏;蔡政安
貢獻者 國立政治大學統計學系
行政院國家科學委員會
關鍵詞 統計;隨機森林分類方法;檢定基因集
日期 2011
上傳時間 30-Aug-2012 09:59:20 (UTC+8)
摘要 近年來在基因微陣列(microarray)實驗中,越來越多的研究人員將研究目的由檢定個別基因與外顯表現變數(phenotype)的相關性,擴展到檢定特定基因集合(gene-set)的顯著性。研究人員依據基因之生物功能將基因歸類,目前已有多個公開資料庫提供基因組相關資訊。基因集合的顯著性檢定可分為兩類,第一類稱為競爭性檢定(competitive test),主要目的為檢定一特定基因集合在相較於其他的基因集合下,有特別顯著的表現。第二類則稱為自足的檢定(self-contained test),主要在檢定此特定基因集合是否有顯著表現。在這個研究中,我們將建立依據基因集合的分類器,並以此分類器的預測誤差率來評估此集合與外顯變數的相關性,我們將利用隨機森林(random forest)來建立分類器。由於此二個檢定的虛無假設不同,故其虛無分配也不同,我們在研究中也將探討各檢定的P值的計算方式。最後我們將應用我們的方法在實際資料上以與其他方法作比較,另外也將設計電腦模擬實驗來驗證本方法的有效性。
In DNA microarray studies, a gene-set analysis (GSA) is used to evaluate the association between the expression of biological pathways, or a priori defined gene sets, and a particular phenotype. Two types of differentially expressed testing are of research interest: the competitive testing and the self-contained testing. The competitive test is to determine whether the specific gene set is relatively differentially expressed when compared to other gene sets. The self-contained test is interested in finding whether the gene set alone is differentially expressed. The two tests involve different null distributions. To take consideration on the interaction or correlation within the gene set, we consider assessing the significance of the gene set by the performance of a classifier developed upon the gene set. In this study, the Random Forest classification is applied. For each of the two tests, the corresponding empirical P-value of an observed out-of-bag (OOB) error rate of the classifier is introduced by using adequate resampling method. Several real examples will be analyzed for comparison. A simulation study will be conducted for verification.
關聯 應用研究
學術補助
研究期間:10008~ 10107
研究經費:441仟元
資料類型 report
dc.contributor 國立政治大學統計學系en_US
dc.contributor 行政院國家科學委員會en_US
dc.creator (作者) 薛慧敏;蔡政安zh_TW
dc.date (日期) 2011en_US
dc.date.accessioned 30-Aug-2012 09:59:20 (UTC+8)-
dc.date.available 30-Aug-2012 09:59:20 (UTC+8)-
dc.date.issued (上傳時間) 30-Aug-2012 09:59:20 (UTC+8)-
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/53403-
dc.description.abstract (摘要) 近年來在基因微陣列(microarray)實驗中,越來越多的研究人員將研究目的由檢定個別基因與外顯表現變數(phenotype)的相關性,擴展到檢定特定基因集合(gene-set)的顯著性。研究人員依據基因之生物功能將基因歸類,目前已有多個公開資料庫提供基因組相關資訊。基因集合的顯著性檢定可分為兩類,第一類稱為競爭性檢定(competitive test),主要目的為檢定一特定基因集合在相較於其他的基因集合下,有特別顯著的表現。第二類則稱為自足的檢定(self-contained test),主要在檢定此特定基因集合是否有顯著表現。在這個研究中,我們將建立依據基因集合的分類器,並以此分類器的預測誤差率來評估此集合與外顯變數的相關性,我們將利用隨機森林(random forest)來建立分類器。由於此二個檢定的虛無假設不同,故其虛無分配也不同,我們在研究中也將探討各檢定的P值的計算方式。最後我們將應用我們的方法在實際資料上以與其他方法作比較,另外也將設計電腦模擬實驗來驗證本方法的有效性。en_US
dc.description.abstract (摘要) In DNA microarray studies, a gene-set analysis (GSA) is used to evaluate the association between the expression of biological pathways, or a priori defined gene sets, and a particular phenotype. Two types of differentially expressed testing are of research interest: the competitive testing and the self-contained testing. The competitive test is to determine whether the specific gene set is relatively differentially expressed when compared to other gene sets. The self-contained test is interested in finding whether the gene set alone is differentially expressed. The two tests involve different null distributions. To take consideration on the interaction or correlation within the gene set, we consider assessing the significance of the gene set by the performance of a classifier developed upon the gene set. In this study, the Random Forest classification is applied. For each of the two tests, the corresponding empirical P-value of an observed out-of-bag (OOB) error rate of the classifier is introduced by using adequate resampling method. Several real examples will be analyzed for comparison. A simulation study will be conducted for verification.en_US
dc.language.iso en_US-
dc.relation (關聯) 應用研究en_US
dc.relation (關聯) 學術補助en_US
dc.relation (關聯) 研究期間:10008~ 10107en_US
dc.relation (關聯) 研究經費:441仟元en_US
dc.subject (關鍵詞) 統計;隨機森林分類方法;檢定基因集en_US
dc.title (題名) 運用隨機森林分類方法在檢定基因集的顯著性zh_TW
dc.title.alternative (其他題名) Using Random Forest in Testing the Significance of a Gene-Seten_US
dc.type (資料類型) reporten