學術產出-學位論文
文章檢視/開啟
書目匯出
-
題名 基於大數據資料的非監督分散式分群演算法
An Effective Distributed GHSOM Algorithm for Unsupervised Clustering on Big Data作者 邱垂暉
Chiu, Chui Hui貢獻者 郁方
Yu, Fang
邱垂暉
Chiu, Chui Hui關鍵詞 非監督式分群
GHSOM
Actor Model
惡意程式偵測
平行運算
Unsupervised clustering
GHSOM
Actor model
Malware detection
Parallel computation日期 2017 上傳時間 10-八月-2017 11:13:04 (UTC+8) 摘要 基於屬性相似度將樣本進行分群的技術已經被廣泛應用在許多領域,如模式識別,特徵提取和惡意行為偵測。由於此技術的重要性,很多人已經將各種分群技術利用分散式框架進行再製,例如K-means搭配Hadoop在Apache Mahout平台上。由於K-means需要預先定義分群數量,而自組織映射圖(SOM)需要預先定義圖的大小,所以能夠自動將樣本依照樣本間的變化容差進行分群的GHSOM(增長層次自組織映射圖)就提供了一個很棒的非監督學習方法用來針對某些資訊不完整的資料。然而,GHSOM目前並不是一個分散式的演算法,這就限制了其在大數據資料的應用上。在本篇論文中,我們提出了一種新的分散式GHSOM演算法。我們使用Scala的Actor Model來實現GHSOM的分散式系統,我們將GHSOM演算法中的水平擴增以及垂直擴增交由Actor來處理並顯示出顯著的性能提升。為了評估我們所提出的方法,我們收集並分析了數千個惡意程式在現實生活中的執行行為,並通過在數百萬個樣本上進行非監督分群後推導出惡意程式行為的檢測規則來顯示其性能的改進、規則有效性以及實踐中的潛在用法。
Clustering techniques that group samples based on their attribute similarity have been widely used in many fields such as pattern recognition, feature extraction and malicious behavior characterization. Due to its importance, various clustering techniques have been developed with distributed frameworks such as K-means with Hadoop in Apache Mahout for scalable computation. While K-means requires the number of clusters and self organizing maps (SOM) requires the map size to be given, the technique of GHSOM (growing hierarchical self organizing maps) that clusters samples dynamically to satisfy the requirement on tolerance of variation between samples, poses an attractive unsupervised learning solution for data that have limited information to decide the number of clusters in advance. However it is not scalable with sequential computation, which limits its applications on big data. In this paper, we present a novel distributed algorithm on GHSOM. We take advantage of parallel computation with scala actor model for GHSOM construction, distributing vertical and horizontal expansion tasks to actors and showing significant performance improvement. To evaluate the presented approach, we collect and analyze execution behaviors of thousands of malware in real life and derive detection rules with the presented unsupervised clustering on millions samples, showing its performance improvement, rule effectiveness and potential usage in practice.參考文獻 [1] "Kvm," http://www.linux-kvm.org/page/Main Page/, (Visited on 7/15/2016).[2] S.-W. Lee and F. Yu, "Securing kvm-based cloud systems via virtualization intro-spection," in System Sciences (HICSS), 2014 47th Hawaii International Conferenceon. IEEE, 2014, pp. 5028-5037.[3] T. Kohonen, "The self-organizing map," Neurocomputing, vol. 21, no. 1, pp. 1-6,1998.[4] J. Vesanto, "Som-based data visualization methods," Intelligent data analysis, vol. 3,no. 2, pp. 111-126, 1999.[5] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E. S. Lan-der, and T. R. Golub, "Interpreting patterns of gene expression with self-organizingmaps: methods and application to hematopoietic differentiation," Proceedings of theNational Academy of Sciences, vol. 96, no. 6, pp. 2907-2912, 1999.[6] E. Alhoniemi, J. Hollmen, O. Simula, and J. Vesanto, "Process monitoring and mod-eling using the self-organizing map," Integrated Computer-Aided Engineering, vol. 6,no. 1, pp. 3-14, 1999.[7] A. M. Kalteh, P. Hjorth, and R. Berndtsson, "Review of the self-organizing map(som) approach in water resources: Analysis, modelling and application," Environ-mental Modelling & Software, vol. 23, no. 7, pp. 835-845, 2008.[8] E. J. Palomo, J. North, D. Elizondo, R. M. Luque, and T. Watson, "Application ofgrowing hierarchical som for visualisation of network forensics traffic data," NeuralNetworks, vol. 32, pp. 275-284, 2012.[9] S.-Y. Huang and Y.-N. Huang, "Network traffic anomaly detection based on growinghierarchical som," in Dependable Systems and Networks (DSN), 2013 43rd AnnualIEEE/IFIP International Conference on. IEEE, 2013, pp. 1-2.[10] Y.-H. Li, Y.-R. Tzeng, and F. Yu, "Viso: Characterizing malicious behaviors ofvirtual machines with unsupervised clustering," in Cloud Computing Technology andScience (CloudCom), 2015 IEEE 7th International Conference on. IEEE, 2015, pp.34-41.[11] R. M. Esteves, R. Pais, and C. Rong, "K-means clustering in the cloud-a mahouttest," in Advanced Information Networking and Applications (WAINA), 2011 IEEEWorkshops of International Conference on. IEEE, 2011, pp. 514-519.[12] "Apache mahout," http://mahout.apache.org/.[13] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, andA. Y. Wu, "An effcient k-means clustering algorithm: Analysis and implementa-tion," IEEE transactions on pattern analysis and machine intelligence, vol. 24, no. 7,pp. 881-892, 2002.[14] A. McAfee, E. Brynjolfsson, T. H. Davenport, D. Patil, and D. Barton, "Big data,"The management revolution. Harvard Bus Rev, vol. 90, no. 10, pp. 61-67, 2012.[15] A. Fahad, N. Alshatri, Z. Tari, A. Alamri, I. Khalil, A. Y. Zomaya, S. Foufou, andA. Bouras, "A survey of clustering algorithms for big data: Taxonomy and empirical analysis," IEEE transactions on emerging topics in computing, vol. 2, no. 3, pp.267-279, 2014.[16] T. K. Moon, "The expectation-maximization algorithm," IEEE Signal processingmagazine, vol. 13, no. 6, pp. 47-60, 1996.[17] "Bloom filter," http://en.wikipedia.org/wiki/Bloom_filter/, (Visited on 10/15/2016).[18] K. Leung and C. Leckie, "Unsupervised anomaly detection in network intrusion de-tection using clusters," in Proceedings of the Twenty-eighth Australasian conferenceon Computer Science-Volume 38. Australian Computer Society, Inc., 2005, pp.333-342.[19] I. Burguera, U. Zurutuza, and S. Nadjm-Tehrani, "Crowdroid: behavior-based mal-ware detection system for android," in Proceedings of the 1st ACM workshop onSecurity and privacy in smartphones and mobile devices. ACM, 2011, pp. 15-26.[20] C. Hewitt, "Actor model of computation: scalable robust information systems,"arXiv preprint arXiv:1008.1459, 2010.[21] "Akka," http://akka.io/, (Visited on 10/15/2016).[22] "Cuckoo sandbox," http://cuckoosandbox.org/, (Visited on 7/15/2016).[23] "Malware knowledge base," http://owl.nchc.org.tw/, (Visited on 6/20/2016).[24] S.-W. Hsiao, Y.-N. Chen, Y. S. Sun, and M. C. Chen, "Combining dynamic pas-sive analysis and active fingerprinting for effective bot malware detection in virtu-alized environments," in International Conference on Network and System Security.Springer, 2013, pp. 699-706.[25] "Virustotal," https://www.virustotal.com, (Visited on 4/15/2017).[26] M. Dittenbach, D. Merkl, and A. Rauber, "The growing hierarchical self-organizingmap," in Neural Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNS-ENNSInternational Joint Conference on, vol. 6. IEEE, 2000, pp. 15-19.[27] J. A. Hartigan and M. A.Wong, "Algorithm as 136: A k-means clustering algorithm,"Journal of the Royal Statistical Society. Series C (Applied Statistics), vol. 28, no. 1,pp. 100-108, 1979.[28] A. Broder and M. Mitzenmacher, "Network applications of bloom filters: A survey,"Internet mathematics, vol. 1, no. 4, pp. 485-509, 2004. 描述 碩士
國立政治大學
資訊管理學系
104356019資料來源 http://thesis.lib.nccu.edu.tw/record/#G0104356019 資料類型 thesis dc.contributor.advisor 郁方 zh_TW dc.contributor.advisor Yu, Fang en_US dc.contributor.author (作者) 邱垂暉 zh_TW dc.contributor.author (作者) Chiu, Chui Hui en_US dc.creator (作者) 邱垂暉 zh_TW dc.creator (作者) Chiu, Chui Hui en_US dc.date (日期) 2017 en_US dc.date.accessioned 10-八月-2017 11:13:04 (UTC+8) - dc.date.available 10-八月-2017 11:13:04 (UTC+8) - dc.date.issued (上傳時間) 10-八月-2017 11:13:04 (UTC+8) - dc.identifier (其他 識別碼) G0104356019 en_US dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/111897 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 資訊管理學系 zh_TW dc.description (描述) 104356019 zh_TW dc.description.abstract (摘要) 基於屬性相似度將樣本進行分群的技術已經被廣泛應用在許多領域,如模式識別,特徵提取和惡意行為偵測。由於此技術的重要性,很多人已經將各種分群技術利用分散式框架進行再製,例如K-means搭配Hadoop在Apache Mahout平台上。由於K-means需要預先定義分群數量,而自組織映射圖(SOM)需要預先定義圖的大小,所以能夠自動將樣本依照樣本間的變化容差進行分群的GHSOM(增長層次自組織映射圖)就提供了一個很棒的非監督學習方法用來針對某些資訊不完整的資料。然而,GHSOM目前並不是一個分散式的演算法,這就限制了其在大數據資料的應用上。在本篇論文中,我們提出了一種新的分散式GHSOM演算法。我們使用Scala的Actor Model來實現GHSOM的分散式系統,我們將GHSOM演算法中的水平擴增以及垂直擴增交由Actor來處理並顯示出顯著的性能提升。為了評估我們所提出的方法,我們收集並分析了數千個惡意程式在現實生活中的執行行為,並通過在數百萬個樣本上進行非監督分群後推導出惡意程式行為的檢測規則來顯示其性能的改進、規則有效性以及實踐中的潛在用法。 zh_TW dc.description.abstract (摘要) Clustering techniques that group samples based on their attribute similarity have been widely used in many fields such as pattern recognition, feature extraction and malicious behavior characterization. Due to its importance, various clustering techniques have been developed with distributed frameworks such as K-means with Hadoop in Apache Mahout for scalable computation. While K-means requires the number of clusters and self organizing maps (SOM) requires the map size to be given, the technique of GHSOM (growing hierarchical self organizing maps) that clusters samples dynamically to satisfy the requirement on tolerance of variation between samples, poses an attractive unsupervised learning solution for data that have limited information to decide the number of clusters in advance. However it is not scalable with sequential computation, which limits its applications on big data. In this paper, we present a novel distributed algorithm on GHSOM. We take advantage of parallel computation with scala actor model for GHSOM construction, distributing vertical and horizontal expansion tasks to actors and showing significant performance improvement. To evaluate the presented approach, we collect and analyze execution behaviors of thousands of malware in real life and derive detection rules with the presented unsupervised clustering on millions samples, showing its performance improvement, rule effectiveness and potential usage in practice. en_US dc.description.tableofcontents 1 Introduction 12 Related Work 22.1 Comparison of Clustering Algorithm 22.1.1 Self Organizing Map (SOM) 22.1.2 Growing Hierarchical Self-Organizing Map (GHSOM) 42.1.3 K-means 53 Actor-based GHSOM Algorithms 73.1 Distributed GHSOM overview 73.2 Actor Model 83.3 The Distributed GHSOM Algorithm 83.4 The GHSOM Tree Builder 103.5 Expander 124 Performance Evaluation 155 Experiment 155.1 System Architecture 156 Malware Detection with Clusters 177 Experiments 197.1 Data Sets of Malware 197.2 Dataset from OWL 207.2.1 Detection rule for Worm and Trojan type malwares 217.2.2 Moving window approach 217.3 High-dimensional data clustering 227.3.1 Detection rule for Win32 type malwares 237.4 A Data Clustering Example 247.5 Rule Accuracy Evaluation 247.6 Detection Rules for OWL data set 277.6.1 Rule 1 277.6.2 Rule 2 277.6.3 Rule 3 287.6.4 Rule 4 287.6.5 Rule 5 297.6.6 Rule 6 307.6.7 Rule 7 317.6.8 Rule 8 317.6.9 Rule 9 327.6.10 Rule 10 337.6.11 Rule 11 347.6.12 Rule 12 347.6.13 Rule 13 357.6.14 Rule 14 357.6.15 Rule 15 367.6.16 Rule 16 377.6.17 Rule 17 387.6.18 Rule 18 387.6.19 Rule 19 387.7 Detection Rules for OWL data set with moving window value 100 407.7.1 Rule 1 407.7.2 Rule 2 407.7.3 Rule 3 417.7.4 Rule 4 427.7.5 Rule 5 427.7.6 Rule 6 447.7.7 Rule 7 457.7.8 Rule 8 457.7.9 Rule 9 467.7.10 Rule 10 477.7.11 Rule 11 487.7.12 Rule 12 487.7.13 Rule 13 507.7.14 Rule 14 507.7.15 Rule 15 517.7.16 Rule 16 527.8 Detection Rules for OWL data set with moving window value 500 537.8.1 Rule 1 537.8.2 Rule 2 547.8.3 Rule 3 547.8.4 Rule 4 557.8.5 Rule 5 567.8.6 Rule 6 567.8.7 Rule 7 577.8.8 Rule 8 577.8.9 Rule 9 587.8.10 Rule 10 597.8.11 Rule 11 607.9 Rule Comparison with Moving Window Approach 607.9.1 Malware Rule Comparison 617.9.2 An Example of Detection Rules of Malware Type: Win32/Allaple.155648 627.10 Accuracy Comparison with K-means 638 Conclusion 65References 65 zh_TW dc.format.extent 1589924 bytes - dc.format.mimetype application/pdf - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0104356019 en_US dc.subject (關鍵詞) 非監督式分群 zh_TW dc.subject (關鍵詞) GHSOM zh_TW dc.subject (關鍵詞) Actor Model zh_TW dc.subject (關鍵詞) 惡意程式偵測 zh_TW dc.subject (關鍵詞) 平行運算 zh_TW dc.subject (關鍵詞) Unsupervised clustering en_US dc.subject (關鍵詞) GHSOM en_US dc.subject (關鍵詞) Actor model en_US dc.subject (關鍵詞) Malware detection en_US dc.subject (關鍵詞) Parallel computation en_US dc.title (題名) 基於大數據資料的非監督分散式分群演算法 zh_TW dc.title (題名) An Effective Distributed GHSOM Algorithm for Unsupervised Clustering on Big Data en_US dc.type (資料類型) thesis en_US dc.relation.reference (參考文獻) [1] "Kvm," http://www.linux-kvm.org/page/Main Page/, (Visited on 7/15/2016).[2] S.-W. Lee and F. Yu, "Securing kvm-based cloud systems via virtualization intro-spection," in System Sciences (HICSS), 2014 47th Hawaii International Conferenceon. IEEE, 2014, pp. 5028-5037.[3] T. Kohonen, "The self-organizing map," Neurocomputing, vol. 21, no. 1, pp. 1-6,1998.[4] J. Vesanto, "Som-based data visualization methods," Intelligent data analysis, vol. 3,no. 2, pp. 111-126, 1999.[5] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E. S. Lan-der, and T. R. Golub, "Interpreting patterns of gene expression with self-organizingmaps: methods and application to hematopoietic differentiation," Proceedings of theNational Academy of Sciences, vol. 96, no. 6, pp. 2907-2912, 1999.[6] E. Alhoniemi, J. Hollmen, O. Simula, and J. Vesanto, "Process monitoring and mod-eling using the self-organizing map," Integrated Computer-Aided Engineering, vol. 6,no. 1, pp. 3-14, 1999.[7] A. M. Kalteh, P. Hjorth, and R. Berndtsson, "Review of the self-organizing map(som) approach in water resources: Analysis, modelling and application," Environ-mental Modelling & Software, vol. 23, no. 7, pp. 835-845, 2008.[8] E. J. Palomo, J. North, D. Elizondo, R. M. Luque, and T. Watson, "Application ofgrowing hierarchical som for visualisation of network forensics traffic data," NeuralNetworks, vol. 32, pp. 275-284, 2012.[9] S.-Y. Huang and Y.-N. Huang, "Network traffic anomaly detection based on growinghierarchical som," in Dependable Systems and Networks (DSN), 2013 43rd AnnualIEEE/IFIP International Conference on. IEEE, 2013, pp. 1-2.[10] Y.-H. Li, Y.-R. Tzeng, and F. Yu, "Viso: Characterizing malicious behaviors ofvirtual machines with unsupervised clustering," in Cloud Computing Technology andScience (CloudCom), 2015 IEEE 7th International Conference on. IEEE, 2015, pp.34-41.[11] R. M. Esteves, R. Pais, and C. Rong, "K-means clustering in the cloud-a mahouttest," in Advanced Information Networking and Applications (WAINA), 2011 IEEEWorkshops of International Conference on. IEEE, 2011, pp. 514-519.[12] "Apache mahout," http://mahout.apache.org/.[13] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, andA. Y. Wu, "An effcient k-means clustering algorithm: Analysis and implementa-tion," IEEE transactions on pattern analysis and machine intelligence, vol. 24, no. 7,pp. 881-892, 2002.[14] A. McAfee, E. Brynjolfsson, T. H. Davenport, D. Patil, and D. Barton, "Big data,"The management revolution. Harvard Bus Rev, vol. 90, no. 10, pp. 61-67, 2012.[15] A. Fahad, N. Alshatri, Z. Tari, A. Alamri, I. Khalil, A. Y. Zomaya, S. Foufou, andA. Bouras, "A survey of clustering algorithms for big data: Taxonomy and empirical analysis," IEEE transactions on emerging topics in computing, vol. 2, no. 3, pp.267-279, 2014.[16] T. K. Moon, "The expectation-maximization algorithm," IEEE Signal processingmagazine, vol. 13, no. 6, pp. 47-60, 1996.[17] "Bloom filter," http://en.wikipedia.org/wiki/Bloom_filter/, (Visited on 10/15/2016).[18] K. Leung and C. Leckie, "Unsupervised anomaly detection in network intrusion de-tection using clusters," in Proceedings of the Twenty-eighth Australasian conferenceon Computer Science-Volume 38. Australian Computer Society, Inc., 2005, pp.333-342.[19] I. Burguera, U. Zurutuza, and S. Nadjm-Tehrani, "Crowdroid: behavior-based mal-ware detection system for android," in Proceedings of the 1st ACM workshop onSecurity and privacy in smartphones and mobile devices. ACM, 2011, pp. 15-26.[20] C. Hewitt, "Actor model of computation: scalable robust information systems,"arXiv preprint arXiv:1008.1459, 2010.[21] "Akka," http://akka.io/, (Visited on 10/15/2016).[22] "Cuckoo sandbox," http://cuckoosandbox.org/, (Visited on 7/15/2016).[23] "Malware knowledge base," http://owl.nchc.org.tw/, (Visited on 6/20/2016).[24] S.-W. Hsiao, Y.-N. Chen, Y. S. Sun, and M. C. Chen, "Combining dynamic pas-sive analysis and active fingerprinting for effective bot malware detection in virtu-alized environments," in International Conference on Network and System Security.Springer, 2013, pp. 699-706.[25] "Virustotal," https://www.virustotal.com, (Visited on 4/15/2017).[26] M. Dittenbach, D. Merkl, and A. Rauber, "The growing hierarchical self-organizingmap," in Neural Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNS-ENNSInternational Joint Conference on, vol. 6. IEEE, 2000, pp. 15-19.[27] J. A. Hartigan and M. A.Wong, "Algorithm as 136: A k-means clustering algorithm,"Journal of the Royal Statistical Society. Series C (Applied Statistics), vol. 28, no. 1,pp. 100-108, 1979.[28] A. Broder and M. Mitzenmacher, "Network applications of bloom filters: A survey,"Internet mathematics, vol. 1, no. 4, pp. 485-509, 2004. zh_TW