Publications-NSC Projects

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

題名 基因體資料中的特徵選取(II)
其他題名 A Statistical Procedure of Feature Selection in a Genomic Study
作者 薛慧敏
貢獻者 國立政治大學統計學系
行政院國家科學委員會
關鍵詞 生物技術;Joint confidence region; Optimal sample fraction of case to control; Retrospective logistic regression model; Sample size determination; Two-stage sequential procedure
日期 2008
上傳時間 30-Aug-2012 09:59:05 (UTC+8)
摘要 96學年度 檢定線性判別分析中單一變數的顯著性 在二元分類上,AUC(接受者操作特徵函數線下面積)常用來評估分類準則的判別力。當資料中包含大量變數時,如基因表現陣列資料或蛋白質譜資料,則變數選取為必要且重要的工作。此時,我們假設資料中有兩個變數,並且考慮線性判別函數。在常態假設下,經過推導得知,加入第二個變數所增進的AUC增加量將與兩個變數的有效作用量(effective size)的比值以及變數間的相關係數有密切關係。本年度的研究將針對此加入變數的增益效果的顯著性,提出相關統計檢定方法。第一個方法為在常態分配假設下的參數(parametric)檢定方法,第二個方法則依據AUC的無母數估計量以及以重抽樣(resampling)方法來決定對應的臨界值。我們將運用電腦模擬來研究此些方法的檢定力。 未來期望將這些檢定方法運用在有多個變數的資料上,以進行變數選取工作。 97學年度 線性判別分析的特徵選取:在基因體資料上的運用 在基因體實驗中,如基因表現陣列資料或蛋白質譜資料,我們可以同時獲得大量的特徵變數的觀測值。從資料中偵測出有顯著差異表現量的特徵,與由此進一步發展分類準則為實驗的倆個主要的目的。本年度計畫將以AUC(接受者操作特徵函數線下面積)為準則,發展一個特徵選取策略。利用前年度研究計劃發展的檢定方法,我們將依序檢定經過排序的特徵,並且選取可顯著增加AUC的特徵於後續的分類準則中。此策略將被運用在一組實際的蛋白質質譜儀資料上,我們將透過這組實際資料來比較不同的特徵選取方法的表現。
YEAR 2007 Testing the significance of a variate in a linear discriminant function In developing a binary classification rule, the AUC (area under the receiver operating characteristics) is a commonly used criterion in the assessment of discriminating power. When the data set includes numerous possible variates, such as the gene expression arrays or protein mass spectrometry, variable selection is an essential and important step. For simplicity, we assume two variates, y1, y2, are available in the data set and consider their linear discriminant functions, c1y1+c2y2, in this project. Under normality, the increment of the AUC of the second variate is shown to depend on the effective sizes of the two variates and the correlation between the two variates. For testing the significance of inclusion of the variate, we will propose a parametric statistical procedure under normality. Moreover, we will also develop an alternative procedure, which based on a nonparametric estimation of AUC and a re-sampling method for determination of the critical values. The performance of the test will be investigated through an empirical study. YEAR 2008. Feature selection in linear discriminant analysis: application to genomic data In a genomic experiment, such as gene expression arrays or protein mass spectrometry, a large number of features are assayed simultaneously. Identification of differentially expressed features and development of a classification rule based on the selected features for further prediction are two main and important objectives. This project aims to develop a strategy of feature selection based on the AUC criterion. The features are ordered and sequentially tested by using the parametrical testing procedure developed in the project of year 2007. The features, which have significant increment in AUC, are selected for classification. The strategy will be applied on a real protein mass spectrometry data set. This strategy will be compared with other existing methods.
關聯 基礎研究
學術補助
研究期間:9708~ 9807
研究經費:347仟元
資料類型 report
dc.contributor 國立政治大學統計學系en_US
dc.contributor 行政院國家科學委員會en_US
dc.creator (作者) 薛慧敏zh_TW
dc.date (日期) 2008en_US
dc.date.accessioned 30-Aug-2012 09:59:05 (UTC+8)-
dc.date.available 30-Aug-2012 09:59:05 (UTC+8)-
dc.date.issued (上傳時間) 30-Aug-2012 09:59:05 (UTC+8)-
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/53394-
dc.description.abstract (摘要) 96學年度 檢定線性判別分析中單一變數的顯著性 在二元分類上,AUC(接受者操作特徵函數線下面積)常用來評估分類準則的判別力。當資料中包含大量變數時,如基因表現陣列資料或蛋白質譜資料,則變數選取為必要且重要的工作。此時,我們假設資料中有兩個變數,並且考慮線性判別函數。在常態假設下,經過推導得知,加入第二個變數所增進的AUC增加量將與兩個變數的有效作用量(effective size)的比值以及變數間的相關係數有密切關係。本年度的研究將針對此加入變數的增益效果的顯著性,提出相關統計檢定方法。第一個方法為在常態分配假設下的參數(parametric)檢定方法,第二個方法則依據AUC的無母數估計量以及以重抽樣(resampling)方法來決定對應的臨界值。我們將運用電腦模擬來研究此些方法的檢定力。 未來期望將這些檢定方法運用在有多個變數的資料上,以進行變數選取工作。 97學年度 線性判別分析的特徵選取:在基因體資料上的運用 在基因體實驗中,如基因表現陣列資料或蛋白質譜資料,我們可以同時獲得大量的特徵變數的觀測值。從資料中偵測出有顯著差異表現量的特徵,與由此進一步發展分類準則為實驗的倆個主要的目的。本年度計畫將以AUC(接受者操作特徵函數線下面積)為準則,發展一個特徵選取策略。利用前年度研究計劃發展的檢定方法,我們將依序檢定經過排序的特徵,並且選取可顯著增加AUC的特徵於後續的分類準則中。此策略將被運用在一組實際的蛋白質質譜儀資料上,我們將透過這組實際資料來比較不同的特徵選取方法的表現。en_US
dc.description.abstract (摘要) YEAR 2007 Testing the significance of a variate in a linear discriminant function In developing a binary classification rule, the AUC (area under the receiver operating characteristics) is a commonly used criterion in the assessment of discriminating power. When the data set includes numerous possible variates, such as the gene expression arrays or protein mass spectrometry, variable selection is an essential and important step. For simplicity, we assume two variates, y1, y2, are available in the data set and consider their linear discriminant functions, c1y1+c2y2, in this project. Under normality, the increment of the AUC of the second variate is shown to depend on the effective sizes of the two variates and the correlation between the two variates. For testing the significance of inclusion of the variate, we will propose a parametric statistical procedure under normality. Moreover, we will also develop an alternative procedure, which based on a nonparametric estimation of AUC and a re-sampling method for determination of the critical values. The performance of the test will be investigated through an empirical study. YEAR 2008. Feature selection in linear discriminant analysis: application to genomic data In a genomic experiment, such as gene expression arrays or protein mass spectrometry, a large number of features are assayed simultaneously. Identification of differentially expressed features and development of a classification rule based on the selected features for further prediction are two main and important objectives. This project aims to develop a strategy of feature selection based on the AUC criterion. The features are ordered and sequentially tested by using the parametrical testing procedure developed in the project of year 2007. The features, which have significant increment in AUC, are selected for classification. The strategy will be applied on a real protein mass spectrometry data set. This strategy will be compared with other existing methods.en_US
dc.language.iso en_US-
dc.relation (關聯) 基礎研究en_US
dc.relation (關聯) 學術補助en_US
dc.relation (關聯) 研究期間:9708~ 9807en_US
dc.relation (關聯) 研究經費:347仟元en_US
dc.subject (關鍵詞) 生物技術;Joint confidence region; Optimal sample fraction of case to control; Retrospective logistic regression model; Sample size determination; Two-stage sequential procedureen_US
dc.title (題名) 基因體資料中的特徵選取(II)zh_TW
dc.title.alternative (其他題名) A Statistical Procedure of Feature Selection in a Genomic Studyen_US
dc.type (資料類型) reporten