階層式分群方法的同質性與穩固性

學術產出-Theses

Article View/Open

pdf(0)

Publication Export

Google Scholar^TM

政大圖書館

學術資源探索系統

Citation Infomation

Simple Record
Full Record

題名	階層式分群方法的同質性與穩固性 Homogeneity and Stability of Hierarchical Clustering
作者	林韋志 Lin, Wei-Chih
貢獻者	周珮婷 Chou, Pei-Ting 林韋志 Lin, Wei-Chih
關鍵詞	非監督機器學習階層式分群分群驗證 Unsupervised Machine Learning Hierarchical Clustering Cluster Validation
日期	2021
上傳時間	1-Jul-2021 17:34:21 (UTC+8)
摘要	現今，驗證分群結果較主流的方法是透過計算各種cluster validation index來檢驗，但是這些指數在類別變數很多的資料時卻不一定能得到合理的答案，因此，本研究利用階層式分群對目標變數建立分群樹，對另一變數則利用歐式距離建立分群樹，再根據兩分群樹繪製熱力圖，從熱力圖的顏色區塊找出資料幾何較相關的群體；接著，利用ANOVA的概念模擬原始資料，並以模擬資料的分群編碼繪製信度直方圖，以呈現群體相似度，進一步驗證階層式分群結果的正確性及穩固性；若信度直方圖所呈現的趨勢與原始分群結果符合，則可判斷分群的結果正確；本研究方法與cluster validation index的差異是我們可以依據熱力圖所呈現的資料幾何結構，在分群樹上的不同高度做切割，找出相關性高的群組，提出檢驗階層式分群結果的信度指標。 Nowadays, the most popular method of validating clustering results is to verify through various cluster validation indexes. However, these indexes may not get reasonable answers whenever data with a lot of categorical variables. This study aims to provide a stable method to detect the homogeneity and stability of Hierarchical Clustering (HC). Multiple HC trees based on simulated data are built, and the path to each category in a tree is recorded. Histogram based on the coding path of simulated data is built to validate the reliability and stability of the clustering results from HC. The difference between the proposed method and the common cluster validation indexes is that we can rely on the clustering results presented by the heatmap, cut at different heights on the dendrogram to find reasonable and highly relevant groups, and increase the flexibility of the clustering.
參考文獻	一、中文參考文獻 [1] 張順全 (1999) 類別資料結構的訊息視覺化二、英文參考文獻 [1] Balcan, M. F., Liang, Y., & Gupta, P. (2014). Robust hierarchical clustering. The Journal of Machine Learning Research, 15(1), 3831-3871. [2] Ben-Hur, A., Elisseeff, A., & Guyon, I. (2001). A stability based method for discovering structure in clustered data. In Biocomputing 2002 (pp. 6-17). [3] Brock, G., Pihur, V., Datta, S., & Datta, S. (2011). clValid, an R package for cluster validation. Journal of Statistical Software (Brock et al., March 2008). [4] Caliński, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics-theory and Methods, 3(1), 1-27. [5] Carlsson, G. E., & Mémoli, F. (2010). Characterization, stability and convergence of hierarchical clustering methods. J. Mach. Learn. Res., 11(Apr), 1425-1470. [6] Chou, E., McVey, C., Hsieh, Y. C., Enriquez, S., & Hsieh, F. (2020). Extreme-K categorical samples problem. arXiv preprint arXiv:2007.15039. [7] Dunn, J. C. (1974). A graph theoretic analysis of pattern classification via Tamura`s fuzzy relation. IEEE Transactions on Systems, Man, and Cybernetics, (3), 310-313. [8] Dunn, J. C. (1974). Well-separated clusters and optimal fuzzy partitions. Journal of cybernetics, 4(1), 95-104. [9] Fushing, H., & Roy, T. (2018). Complexity of possibly gapped histogram and analysis of histogram. Royal Society open science, 5(2), 171026. [10] Goodman, L. A., & Kruskal, W. H. (1979). Measures of association for cross classifications. Measures of association for cross classifications, 2-34. [11] Rendón, E., Abundez, I., Arizmendi, A., & Quiroz, E. M. (2011). Internal versus external cluster validation indexes. International Journal of computers and communications, 5(1), 27-34. [12] Shannon, C. E. (1948). A mathematical theory of communication. The Bell system technical journal, 27(3), 379-423. [13] Smith, S. P., & Dubes, R. (1980). Stability of a hierarchical clustering. Pattern Recognition, 12(3), 177-187.
描述	碩士國立政治大學統計學系 108354027
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0108354027
資料類型	thesis

dc.contributor.advisor	周珮婷	zh_TW
dc.contributor.advisor	Chou, Pei-Ting	en_US
dc.contributor.author (Authors)	林韋志	zh_TW
dc.contributor.author (Authors)	Lin, Wei-Chih	en_US
dc.creator (作者)	林韋志	zh_TW
dc.creator (作者)	Lin, Wei-Chih	en_US
dc.date (日期)	2021	en_US
dc.date.accessioned	1-Jul-2021 17:34:21 (UTC+8)	-
dc.date.available	1-Jul-2021 17:34:21 (UTC+8)	-
dc.date.issued (上傳時間)	1-Jul-2021 17:34:21 (UTC+8)	-
dc.identifier (Other Identifiers)	G0108354027	en_US
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/135931	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	統計學系	zh_TW
dc.description (描述)	108354027	zh_TW
dc.description.abstract (摘要)	現今，驗證分群結果較主流的方法是透過計算各種cluster validation index來檢驗，但是這些指數在類別變數很多的資料時卻不一定能得到合理的答案，因此，本研究利用階層式分群對目標變數建立分群樹，對另一變數則利用歐式距離建立分群樹，再根據兩分群樹繪製熱力圖，從熱力圖的顏色區塊找出資料幾何較相關的群體；接著，利用ANOVA的概念模擬原始資料，並以模擬資料的分群編碼繪製信度直方圖，以呈現群體相似度，進一步驗證階層式分群結果的正確性及穩固性；若信度直方圖所呈現的趨勢與原始分群結果符合，則可判斷分群的結果正確；本研究方法與cluster validation index的差異是我們可以依據熱力圖所呈現的資料幾何結構，在分群樹上的不同高度做切割，找出相關性高的群組，提出檢驗階層式分群結果的信度指標。	zh_TW
dc.description.abstract (摘要)	Nowadays, the most popular method of validating clustering results is to verify through various cluster validation indexes. However, these indexes may not get reasonable answers whenever data with a lot of categorical variables. This study aims to provide a stable method to detect the homogeneity and stability of Hierarchical Clustering (HC). Multiple HC trees based on simulated data are built, and the path to each category in a tree is recorded. Histogram based on the coding path of simulated data is built to validate the reliability and stability of the clustering results from HC. The difference between the proposed method and the common cluster validation indexes is that we can rely on the clustering results presented by the heatmap, cut at different heights on the dendrogram to find reasonable and highly relevant groups, and increase the flexibility of the clustering.	en_US
dc.description.tableofcontents	目次第一章緒論 1 第二章文獻探討 3 第三章研究方法 6 第一節極端T類別型資料問題之資料結構 6 第二節資料訊息內容及分群距離計算 7 第三節進階分群距離計算及HC樹 9 第四節分群結果評估 10 第四章研究過程與結果 13 一、交通部觀光局觀光市場調查各國來台入境旅客目的統計資料集 13 二、NBA 2019-2020 players shooting dataset 18 三、政大各系外籍生資料集 25 四、Kaggle電商女裝部統計資料集 29 第五章結論與建議 33 第六章參考文獻 34 表次表3-1 資料結構範例 17 表4-1 世界各國來台階層式分群 Cluster validation index 17 表4-2 NBA球員階層式分群 Cluster validation index 23 表4-3 政大各系階層式分群 Cluster validation index 28 表4-4 電商各部門階層式分群 Cluster validation index 32 圖次圖3-1 國家編碼範例 10 圖3-2 分群驗證方法示意圖 12 圖4-1 世界各國來台階層式分群圖 13 圖4-2 世界各國來台目的歐式距離分群圖 14 圖4-3 各國來台目的比例熱力圖 14 圖4-4 A組模擬分群結果 15 圖4-5 B組模擬分群結果 15 圖4-6 C組模擬分群結果 16 圖4-7 D組模擬分群結果 16 圖4-8 E組模擬分群結果 16 圖4-9 19-20賽季NBA球員階層式分群圖 – 左半邊 18 圖4-10 19-20賽季NBA球員階層式分群圖 - 右半邊 19 圖4-11 19-20賽季NBA球員出手型態歐式距離分群圖 19 圖4-12 NBA球員出手比例熱力圖 20 圖4-13 A組模擬分群結果 21 圖4-14 B組模擬分群結果 21 圖4-15 C組模擬分群結果 21 圖4-16 D組模擬分群結果 22 圖4-17 E組模擬分群結果 22 圖4-18 F組模擬分群結果 22 圖4-19 G組模擬分群結果 23 圖4-20 H組模擬分群結果 23 圖4-21 政大各系所階層式分群圖 25 圖4-22 政大外籍生國籍歐式距離分群圖 25 圖4-23 國立政治大學各系外籍生比例熱力圖 26 圖4-24 A組模擬分群結果 27 圖4-25 B組模擬分群結果 27 圖4-26 C組模擬分群結果 28 圖4-27 D組模擬分群結果 28 圖4-28 各部門階層式分群圖 29 圖4-29 顧客評分歐式距離分群圖 30 圖4-30 電商女裝部評分比例熱力圖 30 圖4-31 A組模擬分群結果 31 圖4-32 B組模擬分群結果 31	zh_TW
dc.format.extent	1947826 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0108354027	en_US
dc.subject (關鍵詞)	非監督機器學習	zh_TW
dc.subject (關鍵詞)	階層式分群	zh_TW
dc.subject (關鍵詞)	分群驗證	zh_TW
dc.subject (關鍵詞)	Unsupervised Machine Learning	en_US
dc.subject (關鍵詞)	Hierarchical Clustering	en_US
dc.subject (關鍵詞)	Cluster Validation	en_US
dc.title (題名)	階層式分群方法的同質性與穩固性	zh_TW
dc.title (題名)	Homogeneity and Stability of Hierarchical Clustering	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	一、中文參考文獻 [1] 張順全 (1999) 類別資料結構的訊息視覺化二、英文參考文獻 [1] Balcan, M. F., Liang, Y., & Gupta, P. (2014). Robust hierarchical clustering. The Journal of Machine Learning Research, 15(1), 3831-3871. [2] Ben-Hur, A., Elisseeff, A., & Guyon, I. (2001). A stability based method for discovering structure in clustered data. In Biocomputing 2002 (pp. 6-17). [3] Brock, G., Pihur, V., Datta, S., & Datta, S. (2011). clValid, an R package for cluster validation. Journal of Statistical Software (Brock et al., March 2008). [4] Caliński, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics-theory and Methods, 3(1), 1-27. [5] Carlsson, G. E., & Mémoli, F. (2010). Characterization, stability and convergence of hierarchical clustering methods. J. Mach. Learn. Res., 11(Apr), 1425-1470. [6] Chou, E., McVey, C., Hsieh, Y. C., Enriquez, S., & Hsieh, F. (2020). Extreme-K categorical samples problem. arXiv preprint arXiv:2007.15039. [7] Dunn, J. C. (1974). A graph theoretic analysis of pattern classification via Tamura`s fuzzy relation. IEEE Transactions on Systems, Man, and Cybernetics, (3), 310-313. [8] Dunn, J. C. (1974). Well-separated clusters and optimal fuzzy partitions. Journal of cybernetics, 4(1), 95-104. [9] Fushing, H., & Roy, T. (2018). Complexity of possibly gapped histogram and analysis of histogram. Royal Society open science, 5(2), 171026. [10] Goodman, L. A., & Kruskal, W. H. (1979). Measures of association for cross classifications. Measures of association for cross classifications, 2-34. [11] Rendón, E., Abundez, I., Arizmendi, A., & Quiroz, E. M. (2011). Internal versus external cluster validation indexes. International Journal of computers and communications, 5(1), 27-34. [12] Shannon, C. E. (1948). A mathematical theory of communication. The Bell system technical journal, 27(3), 379-423. [13] Smith, S. P., & Dubes, R. (1980). Stability of a hierarchical clustering. Pattern Recognition, 12(3), 177-187.	zh_TW
dc.identifier.doi (DOI)	10.6814/NCCU202100611	en_US

學術產出-Theses

Article View/Open

Publication Export

Google ScholarTM

政大圖書館

Citation Infomation

Google Scholar^TM