Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

題名 維度縮減應用於蛋白質質譜儀資料
Dimension Reduction on Protein Mass Spectrometry Data
作者 黃靜文
Huang, Ching-Wen
貢獻者 余清祥
Yue, Ching-Syang
黃靜文
Huang, Ching-Wen
關鍵詞 分類
維度縮減
疾病診斷
電腦模擬
Classification
Dimension reduction
Disease diagnosis
Computer simulation
日期 2004
上傳時間 2009-09-14
摘要 本文應用攝護腺癌症蛋白質資料庫,是經由表面強化雷射解吸電離飛行質譜技術的血清蛋白質強度資料,藉此資料判斷受測者是否罹患癌症。此資料庫之受測者包含正常、良腫、癌初和癌末四種類別,其中包括兩筆資料,一筆為包含約48000個區間資料(變數)之原始資料,另一筆為經由人工變數篩選後,僅剩餘779區間資料(變數)之人工處理資料,此兩筆皆為高維度資料,皆約有650個觀察值。高維度資料因變數過多,除了分析不易外,亦造成運算時間較長。故本研究目的即探討在有效的維度縮減方式下,找出最小化分錯率的方法。
     本研究先比較分類方法-支持向量機、類神經網路和分類迴歸樹之優劣,再將較優的分類方法:支持向量機和類神經網路,應用於維度縮減資料之分類。本研究採用之維度縮減方法,包含離散小波分析、主成份分析和主成份分析網路。根據分析結果,離散小波分析和主成份分析表現較佳,而主成份分析網路差強人意。
     本研究除探討以上維度縮減方法對此病例資料庫分類之成效外,亦結合線性維度縮減-主成份分析,非線性維度縮減-主成份分析網路,希望能藉重疊法再改善僅做單一維度縮減方法之病例篩檢分錯率,根據分析結果,重疊法對原始資料改善效果不明顯,但對人工處理資料卻有明顯的改善效果。
In this paper, we study the serum protein data set of prostate cancer, which acquired by Surface-Enhanced Laser Desorption/Ionization Time-of-Flight Mass Spectrometry (SELDI-TOF-MS) technique. The data set, with four populations of prostate cancer patients, includes both raw data and preprocessed data. There are around 48000 variables in raw data and 779 variables in preprocessed data. The sample size of each data is around 650. Because of the high dimensionality, this data set provokes higher level of difficulty and computation time. Therefore, the goal of this study is to search efficient dimension reduction methods.
     We first compare three classification methods: support vector machine, artificial neural network, and classification and regression tree. And, we use discrete wavelet transform, principal component analysis and principal component analysis networks to reduce the data dimension.
     Then, we discuss the dimension reduction methods and propose overlap method that combines the linear dimension reduction method-principal component analysis, and the nonlinear dimension reduction method-principal component analysis networks to improve the classification result. We find that the improvement of overlap method is significant in the preprocessed data, but not significant in the raw data.
參考文獻 【中文部分】
[01] 行政院衛生署,「中華民國九十三年臺灣地區死因統計結果摘要」。
網址:http://www.doh.gov.tw/statistic/data/死因摘要/93年/93.htm
[02] 彭文正譯,Michael J.A. Berry與Gordon S. Linoff著,資料採礦-顧客關係管理暨電子行銷之應用,數博網資訊股份有限公司,2001年。
[03] 葉怡成,應用類神經網路,儒林圖書公司,1997年。
[04] 潘荔錞、蔡志彥和簡志青,「蛋白質體學在臨床醫學之應用」,化工資訊與商情月刊第3期,2003年9月號。
[05] 賴基銘,「癌症篩檢未來的展望:SELDI血清蛋白指紋圖譜的應用」,國家衛生研究院電子報,第52期,2004年6月25日。
網址:http://sars.nhri.org.tw/enews/enews_list_new3.php?volume_indx=
52&enews_dt=2004-06-25
【英文部分】
[06] Alpaydin, E. (2004), Introduction to Machine Learning. MIT Press.
[07] Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J. (1984), Classification and Regression Trees, Wadsworth.
[08] Cottrell, G. W., Munro, P. and Zipser, D. (1987), “Learning Internal Representations from Gray-Scale Images: An Example of Extensional Programming”, In Ninth Annual Conference of the Cognitive Science Society, 462-473. Hillsdale, NJ:Erlbsum.
[09] Cybenko, G. (1989), “Approximation by Superpositions of a Sigmoidal Function,” Mathematical Control Signal Systems, vol.2, 303-314.
[10] Donoho, D. L. and Johnstone, I. M. (1994), “Ideal Spatial Adaptation by Wavelet Shrinkage”, Biometrika, vol.81, 245-455.
[11] Donoho, D. L. and Johnstone, I. M. (1995), “Adapting to Unknown Smoothness via Wavelet Shrinkage”, Journal of the American Statistical Association, vol.90, 1200-1224.
[12] Donoho, D. L. and Johnstone, I. M. (1998), “Minimax Estimation via Wavelet Shrinkage,” Annals of Statistics, vol.26, 879-921.
[13] Daubechies, I. (1992), Ten Lectures on Wavelets, CBMS-NSF Regional Conference Series in Applied Mathematics, SIAM:Philadelphia.
[14] Hornik, K., Stinchcombe, M. and White, H. (1989), Multilayer Feedforward Networks Are Universal Approximations, Neural Networks, vol.2, 336-359.
[15] Hsu, C-W., Chang, C-C. and Lin, C-J. (2003), “A Practical Guide to Support Vector Classification”.
Paper available at http://www.csie.ntu.edu.tw/~cjlin/papers.html.
[16] Huang, T-K., Weng, R. C. and Lin, C-J. (July 2004), “A Generalized Bradley-Terry Model: From Group Competition to Individual Skill”. A short version appears in NIPS.
[17] Johnson, D. E. (1998), Applied Multivariate Methods for Data Analysts, Pacific Grove, Calif. Dluxbury Press.
[18] Mallat, S. G. (1989), “A Theory for Multiresolution Signal Decomposition: the Wavelet Representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vo1.11, No.7, 674-693.
[19] Qu, Y., Adam, B-L., Thornquist, M., Potter, J. D., Thompson, M. L., Yasui, Y., Davis, J., Schellhammer, P. F., Cazares, L., Clements, M. A., Wright, G. L., Jr. and Feng, Z. (March 2003), “Data Reduction Using a Discrete Wavelet Transform in Discriminant Analysis of Very High Dimensionality Data”, BIOMETRICS, vol.59, 143-151.
[20] Rumelhart E., Hinton G. E., and Williams R. J. (1986), Learning Internal Representations by Error Propagation in Parallel Distributed Processing, MIT Press, Cambridge, MA, vol.1, 318-362.
[21] Vapnik V. N. (1995), The Nature of Statistical Learning Theory, Springer, New York.
描述 碩士
國立政治大學
統計研究所
92354012
93
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0923540121
資料類型 thesis
dc.contributor.advisor 余清祥zh_TW
dc.contributor.advisor Yue, Ching-Syangen_US
dc.contributor.author (Authors) 黃靜文zh_TW
dc.contributor.author (Authors) Huang, Ching-Wenen_US
dc.creator (作者) 黃靜文zh_TW
dc.creator (作者) Huang, Ching-Wenen_US
dc.date (日期) 2004en_US
dc.date.accessioned 2009-09-14-
dc.date.available 2009-09-14-
dc.date.issued (上傳時間) 2009-09-14-
dc.identifier (Other Identifiers) G0923540121en_US
dc.identifier.uri (URI) https://nccur.lib.nccu.edu.tw/handle/140.119/30944-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 統計研究所zh_TW
dc.description (描述) 92354012zh_TW
dc.description (描述) 93zh_TW
dc.description.abstract (摘要) 本文應用攝護腺癌症蛋白質資料庫,是經由表面強化雷射解吸電離飛行質譜技術的血清蛋白質強度資料,藉此資料判斷受測者是否罹患癌症。此資料庫之受測者包含正常、良腫、癌初和癌末四種類別,其中包括兩筆資料,一筆為包含約48000個區間資料(變數)之原始資料,另一筆為經由人工變數篩選後,僅剩餘779區間資料(變數)之人工處理資料,此兩筆皆為高維度資料,皆約有650個觀察值。高維度資料因變數過多,除了分析不易外,亦造成運算時間較長。故本研究目的即探討在有效的維度縮減方式下,找出最小化分錯率的方法。
     本研究先比較分類方法-支持向量機、類神經網路和分類迴歸樹之優劣,再將較優的分類方法:支持向量機和類神經網路,應用於維度縮減資料之分類。本研究採用之維度縮減方法,包含離散小波分析、主成份分析和主成份分析網路。根據分析結果,離散小波分析和主成份分析表現較佳,而主成份分析網路差強人意。
     本研究除探討以上維度縮減方法對此病例資料庫分類之成效外,亦結合線性維度縮減-主成份分析,非線性維度縮減-主成份分析網路,希望能藉重疊法再改善僅做單一維度縮減方法之病例篩檢分錯率,根據分析結果,重疊法對原始資料改善效果不明顯,但對人工處理資料卻有明顯的改善效果。
zh_TW
dc.description.abstract (摘要) In this paper, we study the serum protein data set of prostate cancer, which acquired by Surface-Enhanced Laser Desorption/Ionization Time-of-Flight Mass Spectrometry (SELDI-TOF-MS) technique. The data set, with four populations of prostate cancer patients, includes both raw data and preprocessed data. There are around 48000 variables in raw data and 779 variables in preprocessed data. The sample size of each data is around 650. Because of the high dimensionality, this data set provokes higher level of difficulty and computation time. Therefore, the goal of this study is to search efficient dimension reduction methods.
     We first compare three classification methods: support vector machine, artificial neural network, and classification and regression tree. And, we use discrete wavelet transform, principal component analysis and principal component analysis networks to reduce the data dimension.
     Then, we discuss the dimension reduction methods and propose overlap method that combines the linear dimension reduction method-principal component analysis, and the nonlinear dimension reduction method-principal component analysis networks to improve the classification result. We find that the improvement of overlap method is significant in the preprocessed data, but not significant in the raw data.
en_US
dc.description.tableofcontents 第一章 緒論 . 1
     第一節 研究動機和目的……………………………………………….1
     第二節 資料來源與簡介……………………………………………….2
     1.2.1. 表面強化雷射解吸電離飛行質譜技術……….…...…….….2
     1.2.2. 資料簡介….……...…..………………………………………4
     第三節 研究工具與設定……...………………………………………..5
     
     第二章 分類方法 . 7
     第一節 支持向量機………………………...…………………………..7
     2.1.1. 方法簡介.……….……...…………………...………………..7
     2.1.2. 參數設定……….…………..……………...…………………8
     第二節 類神經網路…………………………………………………….9
     2.2.1. 方法簡介……….……………………………...….………….9
     2.2.2. 參數設定……...………………….……………...………….10
     第三節 分類迴歸樹…………………………………………………...12
     第四節 實證結果……………...………………………………………13
     
     第三章 維度縮減方法 . 15
     第一節 離散小波轉換……………………………..………………….15
     3.1.1. 方法簡介……………………………………………………15
     3.1.2. 參數設定…………………………………………...……….17
     3.1.3. 小波係數個數選取…………………………………………18
     第二節 主成份分析……………………………………...……………20
     3.2.1. 方法簡介……………………………………………………20
     3.2.2. 主成份個數選取……………...……………………….……21
     3.2.3. 主成份分析效果……………………………………………25
     第三節 主成份分析網路……………………………...………………26
     3.3.1. 方法簡介……………...…………………….………………26
     3.3.2. 參數設定……………………………………………………27
     3.3.3. 隱藏層節點數選取……….…………...……………………27
     第四節 方法比較……………………………….…..…………………31
     
     第四章 重疊法 . 36
     第一節 方法簡介………………………………………...……………36
     第二節 實證結果……………………………………………………...40
     
     第五章 結論與建議 . 45
     第一節 結論………………………………………………………...…45
     第二節 建議…………………………………………………………...46
     
     參考文獻……………………………………………..……………………….48
     
     附錄一-各分類方法之平均分錯率和標準差…………………...…………51
     附錄二-主成份分析之平均分錯率和標準差……………………...………52
     附錄三-主成份分析網路之平均分錯率和標準差…...……………………56
     附錄四-重疊法之平均分錯率和標準差……………………...……………58
     附錄五-維度縮減後之類神經網路分類輸出值直方圖…………………...61
     附錄六-維度縮減後之各區間輸出值之分錯比例………………………...64
zh_TW
dc.language.iso en_US-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0923540121en_US
dc.subject (關鍵詞) 分類zh_TW
dc.subject (關鍵詞) 維度縮減zh_TW
dc.subject (關鍵詞) 疾病診斷zh_TW
dc.subject (關鍵詞) 電腦模擬zh_TW
dc.subject (關鍵詞) Classificationen_US
dc.subject (關鍵詞) Dimension reductionen_US
dc.subject (關鍵詞) Disease diagnosisen_US
dc.subject (關鍵詞) Computer simulationen_US
dc.title (題名) 維度縮減應用於蛋白質質譜儀資料zh_TW
dc.title (題名) Dimension Reduction on Protein Mass Spectrometry Dataen_US
dc.type (資料類型) thesisen
dc.relation.reference (參考文獻) 【中文部分】zh_TW
dc.relation.reference (參考文獻) [01] 行政院衛生署,「中華民國九十三年臺灣地區死因統計結果摘要」。zh_TW
dc.relation.reference (參考文獻) 網址:http://www.doh.gov.tw/statistic/data/死因摘要/93年/93.htmzh_TW
dc.relation.reference (參考文獻) [02] 彭文正譯,Michael J.A. Berry與Gordon S. Linoff著,資料採礦-顧客關係管理暨電子行銷之應用,數博網資訊股份有限公司,2001年。zh_TW
dc.relation.reference (參考文獻) [03] 葉怡成,應用類神經網路,儒林圖書公司,1997年。zh_TW
dc.relation.reference (參考文獻) [04] 潘荔錞、蔡志彥和簡志青,「蛋白質體學在臨床醫學之應用」,化工資訊與商情月刊第3期,2003年9月號。zh_TW
dc.relation.reference (參考文獻) [05] 賴基銘,「癌症篩檢未來的展望:SELDI血清蛋白指紋圖譜的應用」,國家衛生研究院電子報,第52期,2004年6月25日。zh_TW
dc.relation.reference (參考文獻) 網址:http://sars.nhri.org.tw/enews/enews_list_new3.php?volume_indx=zh_TW
dc.relation.reference (參考文獻) 52&enews_dt=2004-06-25zh_TW
dc.relation.reference (參考文獻) 【英文部分】zh_TW
dc.relation.reference (參考文獻) [06] Alpaydin, E. (2004), Introduction to Machine Learning. MIT Press.zh_TW
dc.relation.reference (參考文獻) [07] Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J. (1984), Classification and Regression Trees, Wadsworth.zh_TW
dc.relation.reference (參考文獻) [08] Cottrell, G. W., Munro, P. and Zipser, D. (1987), “Learning Internal Representations from Gray-Scale Images: An Example of Extensional Programming”, In Ninth Annual Conference of the Cognitive Science Society, 462-473. Hillsdale, NJ:Erlbsum.zh_TW
dc.relation.reference (參考文獻) [09] Cybenko, G. (1989), “Approximation by Superpositions of a Sigmoidal Function,” Mathematical Control Signal Systems, vol.2, 303-314.zh_TW
dc.relation.reference (參考文獻) [10] Donoho, D. L. and Johnstone, I. M. (1994), “Ideal Spatial Adaptation by Wavelet Shrinkage”, Biometrika, vol.81, 245-455.zh_TW
dc.relation.reference (參考文獻) [11] Donoho, D. L. and Johnstone, I. M. (1995), “Adapting to Unknown Smoothness via Wavelet Shrinkage”, Journal of the American Statistical Association, vol.90, 1200-1224.zh_TW
dc.relation.reference (參考文獻) [12] Donoho, D. L. and Johnstone, I. M. (1998), “Minimax Estimation via Wavelet Shrinkage,” Annals of Statistics, vol.26, 879-921.zh_TW
dc.relation.reference (參考文獻) [13] Daubechies, I. (1992), Ten Lectures on Wavelets, CBMS-NSF Regional Conference Series in Applied Mathematics, SIAM:Philadelphia.zh_TW
dc.relation.reference (參考文獻) [14] Hornik, K., Stinchcombe, M. and White, H. (1989), Multilayer Feedforward Networks Are Universal Approximations, Neural Networks, vol.2, 336-359.zh_TW
dc.relation.reference (參考文獻) [15] Hsu, C-W., Chang, C-C. and Lin, C-J. (2003), “A Practical Guide to Support Vector Classification”.zh_TW
dc.relation.reference (參考文獻) Paper available at http://www.csie.ntu.edu.tw/~cjlin/papers.html.zh_TW
dc.relation.reference (參考文獻) [16] Huang, T-K., Weng, R. C. and Lin, C-J. (July 2004), “A Generalized Bradley-Terry Model: From Group Competition to Individual Skill”. A short version appears in NIPS.zh_TW
dc.relation.reference (參考文獻) [17] Johnson, D. E. (1998), Applied Multivariate Methods for Data Analysts, Pacific Grove, Calif. Dluxbury Press.zh_TW
dc.relation.reference (參考文獻) [18] Mallat, S. G. (1989), “A Theory for Multiresolution Signal Decomposition: the Wavelet Representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vo1.11, No.7, 674-693.zh_TW
dc.relation.reference (參考文獻) [19] Qu, Y., Adam, B-L., Thornquist, M., Potter, J. D., Thompson, M. L., Yasui, Y., Davis, J., Schellhammer, P. F., Cazares, L., Clements, M. A., Wright, G. L., Jr. and Feng, Z. (March 2003), “Data Reduction Using a Discrete Wavelet Transform in Discriminant Analysis of Very High Dimensionality Data”, BIOMETRICS, vol.59, 143-151.zh_TW
dc.relation.reference (參考文獻) [20] Rumelhart E., Hinton G. E., and Williams R. J. (1986), Learning Internal Representations by Error Propagation in Parallel Distributed Processing, MIT Press, Cambridge, MA, vol.1, 318-362.zh_TW
dc.relation.reference (參考文獻) [21] Vapnik V. N. (1995), The Nature of Statistical Learning Theory, Springer, New York.zh_TW