基於大數據資料的非監督分散式分群演算法

Publications-Theses

Article View/Open

pdf(72)

Publication Export

Google Scholar^TM

題名	基於大數據資料的非監督分散式分群演算法 An Effective Distributed GHSOM Algorithm for Unsupervised Clustering on Big Data
作者	邱垂暉 Chiu, Chui Hui
貢獻者	郁方 Yu, Fang 邱垂暉 Chiu, Chui Hui
關鍵詞	非監督式分群 GHSOM Actor Model 惡意程式偵測平行運算 Unsupervised clustering GHSOM Actor model Malware detection Parallel computation
日期	2017
上傳時間	10-Aug-2017 11:13:04 (UTC+8)
摘要	基於屬性相似度將樣本進行分群的技術已經被廣泛應用在許多領域，如模式識別，特徵提取和惡意行為偵測。由於此技術的重要性，很多人已經將各種分群技術利用分散式框架進行再製，例如K-means搭配Hadoop在Apache Mahout平台上。由於K-means需要預先定義分群數量，而自組織映射圖（SOM）需要預先定義圖的大小，所以能夠自動將樣本依照樣本間的變化容差進行分群的GHSOM（增長層次自組織映射圖）就提供了一個很棒的非監督學習方法用來針對某些資訊不完整的資料。然而，GHSOM目前並不是一個分散式的演算法，這就限制了其在大數據資料的應用上。在本篇論文中，我們提出了一種新的分散式GHSOM演算法。我們使用Scala的Actor Model來實現GHSOM的分散式系統，我們將GHSOM演算法中的水平擴增以及垂直擴增交由Actor來處理並顯示出顯著的性能提升。為了評估我們所提出的方法，我們收集並分析了數千個惡意程式在現實生活中的執行行為，並通過在數百萬個樣本上進行非監督分群後推導出惡意程式行為的檢測規則來顯示其性能的改進、規則有效性以及實踐中的潛在用法。 Clustering techniques that group samples based on their attribute similarity have been widely used in many fields such as pattern recognition, feature extraction and malicious behavior characterization. Due to its importance, various clustering techniques have been developed with distributed frameworks such as K-means with Hadoop in Apache Mahout for scalable computation. While K-means requires the number of clusters and self organizing maps (SOM) requires the map size to be given, the technique of GHSOM (growing hierarchical self organizing maps) that clusters samples dynamically to satisfy the requirement on tolerance of variation between samples, poses an attractive unsupervised learning solution for data that have limited information to decide the number of clusters in advance. However it is not scalable with sequential computation, which limits its applications on big data. In this paper, we present a novel distributed algorithm on GHSOM. We take advantage of parallel computation with scala actor model for GHSOM construction, distributing vertical and horizontal expansion tasks to actors and showing significant performance improvement. To evaluate the presented approach, we collect and analyze execution behaviors of thousands of malware in real life and derive detection rules with the presented unsupervised clustering on millions samples, showing its performance improvement, rule effectiveness and potential usage in practice.
參考文獻	[1] "Kvm," http://www.linux-kvm.org/page/Main Page/, (Visited on 7/15/2016). [2] S.-W. Lee and F. Yu, "Securing kvm-based cloud systems via virtualization intro- spection," in System Sciences (HICSS), 2014 47th Hawaii International Conference on. IEEE, 2014, pp. 5028-5037. [3] T. Kohonen, "The self-organizing map," Neurocomputing, vol. 21, no. 1, pp. 1-6, 1998. [4] J. Vesanto, "Som-based data visualization methods," Intelligent data analysis, vol. 3, no. 2, pp. 111-126, 1999. [5] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E. S. Lan- der, and T. R. Golub, "Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation," Proceedings of the National Academy of Sciences, vol. 96, no. 6, pp. 2907-2912, 1999. [6] E. Alhoniemi, J. Hollmen, O. Simula, and J. Vesanto, "Process monitoring and mod- eling using the self-organizing map," Integrated Computer-Aided Engineering, vol. 6, no. 1, pp. 3-14, 1999. [7] A. M. Kalteh, P. Hjorth, and R. Berndtsson, "Review of the self-organizing map (som) approach in water resources: Analysis, modelling and application," Environ- mental Modelling & Software, vol. 23, no. 7, pp. 835-845, 2008. [8] E. J. Palomo, J. North, D. Elizondo, R. M. Luque, and T. Watson, "Application of growing hierarchical som for visualisation of network forensics traffic data," Neural Networks, vol. 32, pp. 275-284, 2012. [9] S.-Y. Huang and Y.-N. Huang, "Network traffic anomaly detection based on growing hierarchical som," in Dependable Systems and Networks (DSN), 2013 43rd Annual IEEE/IFIP International Conference on. IEEE, 2013, pp. 1-2. [10] Y.-H. Li, Y.-R. Tzeng, and F. Yu, "Viso: Characterizing malicious behaviors of virtual machines with unsupervised clustering," in Cloud Computing Technology and Science (CloudCom), 2015 IEEE 7th International Conference on. IEEE, 2015, pp. 34-41. [11] R. M. Esteves, R. Pais, and C. Rong, "K-means clustering in the cloud-a mahout test," in Advanced Information Networking and Applications (WAINA), 2011 IEEE Workshops of International Conference on. IEEE, 2011, pp. 514-519. [12] "Apache mahout," http://mahout.apache.org/. [13] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu, "An effcient k-means clustering algorithm: Analysis and implementa- tion," IEEE transactions on pattern analysis and machine intelligence, vol. 24, no. 7, pp. 881-892, 2002. [14] A. McAfee, E. Brynjolfsson, T. H. Davenport, D. Patil, and D. Barton, "Big data," The management revolution. Harvard Bus Rev, vol. 90, no. 10, pp. 61-67, 2012. [15] A. Fahad, N. Alshatri, Z. Tari, A. Alamri, I. Khalil, A. Y. Zomaya, S. Foufou, and A. Bouras, "A survey of clustering algorithms for big data: Taxonomy and empirical analysis," IEEE transactions on emerging topics in computing, vol. 2, no. 3, pp. 267-279, 2014. [16] T. K. Moon, "The expectation-maximization algorithm," IEEE Signal processing magazine, vol. 13, no. 6, pp. 47-60, 1996. [17] "Bloom filter," http://en.wikipedia.org/wiki/Bloom_filter/, (Visited on 10/15/2016). [18] K. Leung and C. Leckie, "Unsupervised anomaly detection in network intrusion de- tection using clusters," in Proceedings of the Twenty-eighth Australasian conference on Computer Science-Volume 38. Australian Computer Society, Inc., 2005, pp. 333-342. [19] I. Burguera, U. Zurutuza, and S. Nadjm-Tehrani, "Crowdroid: behavior-based mal- ware detection system for android," in Proceedings of the 1st ACM workshop on Security and privacy in smartphones and mobile devices. ACM, 2011, pp. 15-26. [20] C. Hewitt, "Actor model of computation: scalable robust information systems," arXiv preprint arXiv:1008.1459, 2010. [21] "Akka," http://akka.io/, (Visited on 10/15/2016). [22] "Cuckoo sandbox," http://cuckoosandbox.org/, (Visited on 7/15/2016). [23] "Malware knowledge base," http://owl.nchc.org.tw/, (Visited on 6/20/2016). [24] S.-W. Hsiao, Y.-N. Chen, Y. S. Sun, and M. C. Chen, "Combining dynamic pas- sive analysis and active fingerprinting for effective bot malware detection in virtu- alized environments," in International Conference on Network and System Security. Springer, 2013, pp. 699-706. [25] "Virustotal," https://www.virustotal.com, (Visited on 4/15/2017). [26] M. Dittenbach, D. Merkl, and A. Rauber, "The growing hierarchical self-organizing map," in Neural Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on, vol. 6. IEEE, 2000, pp. 15-19. [27] J. A. Hartigan and M. A.Wong, "Algorithm as 136: A k-means clustering algorithm," Journal of the Royal Statistical Society. Series C (Applied Statistics), vol. 28, no. 1, pp. 100-108, 1979. [28] A. Broder and M. Mitzenmacher, "Network applications of bloom filters: A survey," Internet mathematics, vol. 1, no. 4, pp. 485-509, 2004.
描述	碩士國立政治大學資訊管理學系 104356019
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0104356019
資料類型	thesis

dc.contributor.advisor	郁方	zh_TW
dc.contributor.advisor	Yu, Fang	en_US
dc.contributor.author (Authors)	邱垂暉	zh_TW
dc.contributor.author (Authors)	Chiu, Chui Hui	en_US
dc.creator (作者)	邱垂暉	zh_TW
dc.creator (作者)	Chiu, Chui Hui	en_US
dc.date (日期)	2017	en_US
dc.date.accessioned	10-Aug-2017 11:13:04 (UTC+8)	-
dc.date.available	10-Aug-2017 11:13:04 (UTC+8)	-
dc.date.issued (上傳時間)	10-Aug-2017 11:13:04 (UTC+8)	-
dc.identifier (Other Identifiers)	G0104356019	en_US
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/111897	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	資訊管理學系	zh_TW
dc.description (描述)	104356019	zh_TW
dc.description.abstract (摘要)	基於屬性相似度將樣本進行分群的技術已經被廣泛應用在許多領域，如模式識別，特徵提取和惡意行為偵測。由於此技術的重要性，很多人已經將各種分群技術利用分散式框架進行再製，例如K-means搭配Hadoop在Apache Mahout平台上。由於K-means需要預先定義分群數量，而自組織映射圖（SOM）需要預先定義圖的大小，所以能夠自動將樣本依照樣本間的變化容差進行分群的GHSOM（增長層次自組織映射圖）就提供了一個很棒的非監督學習方法用來針對某些資訊不完整的資料。然而，GHSOM目前並不是一個分散式的演算法，這就限制了其在大數據資料的應用上。在本篇論文中，我們提出了一種新的分散式GHSOM演算法。我們使用Scala的Actor Model來實現GHSOM的分散式系統，我們將GHSOM演算法中的水平擴增以及垂直擴增交由Actor來處理並顯示出顯著的性能提升。為了評估我們所提出的方法，我們收集並分析了數千個惡意程式在現實生活中的執行行為，並通過在數百萬個樣本上進行非監督分群後推導出惡意程式行為的檢測規則來顯示其性能的改進、規則有效性以及實踐中的潛在用法。	zh_TW
dc.description.abstract (摘要)	Clustering techniques that group samples based on their attribute similarity have been widely used in many fields such as pattern recognition, feature extraction and malicious behavior characterization. Due to its importance, various clustering techniques have been developed with distributed frameworks such as K-means with Hadoop in Apache Mahout for scalable computation. While K-means requires the number of clusters and self organizing maps (SOM) requires the map size to be given, the technique of GHSOM (growing hierarchical self organizing maps) that clusters samples dynamically to satisfy the requirement on tolerance of variation between samples, poses an attractive unsupervised learning solution for data that have limited information to decide the number of clusters in advance. However it is not scalable with sequential computation, which limits its applications on big data. In this paper, we present a novel distributed algorithm on GHSOM. We take advantage of parallel computation with scala actor model for GHSOM construction, distributing vertical and horizontal expansion tasks to actors and showing significant performance improvement. To evaluate the presented approach, we collect and analyze execution behaviors of thousands of malware in real life and derive detection rules with the presented unsupervised clustering on millions samples, showing its performance improvement, rule effectiveness and potential usage in practice.	en_US
dc.description.tableofcontents	1 Introduction 1 2 Related Work 2 2.1 Comparison of Clustering Algorithm 2 2.1.1 Self Organizing Map (SOM) 2 2.1.2 Growing Hierarchical Self-Organizing Map (GHSOM) 4 2.1.3 K-means 5 3 Actor-based GHSOM Algorithms 7 3.1 Distributed GHSOM overview 7 3.2 Actor Model 8 3.3 The Distributed GHSOM Algorithm 8 3.4 The GHSOM Tree Builder 10 3.5 Expander 12 4 Performance Evaluation 15 5 Experiment 15 5.1 System Architecture 15 6 Malware Detection with Clusters 17 7 Experiments 19 7.1 Data Sets of Malware 19 7.2 Dataset from OWL 20 7.2.1 Detection rule for Worm and Trojan type malwares 21 7.2.2 Moving window approach 21 7.3 High-dimensional data clustering 22 7.3.1 Detection rule for Win32 type malwares 23 7.4 A Data Clustering Example 24 7.5 Rule Accuracy Evaluation 24 7.6 Detection Rules for OWL data set 27 7.6.1 Rule 1 27 7.6.2 Rule 2 27 7.6.3 Rule 3 28 7.6.4 Rule 4 28 7.6.5 Rule 5 29 7.6.6 Rule 6 30 7.6.7 Rule 7 31 7.6.8 Rule 8 31 7.6.9 Rule 9 32 7.6.10 Rule 10 33 7.6.11 Rule 11 34 7.6.12 Rule 12 34 7.6.13 Rule 13 35 7.6.14 Rule 14 35 7.6.15 Rule 15 36 7.6.16 Rule 16 37 7.6.17 Rule 17 38 7.6.18 Rule 18 38 7.6.19 Rule 19 38 7.7 Detection Rules for OWL data set with moving window value 100 40 7.7.1 Rule 1 40 7.7.2 Rule 2 40 7.7.3 Rule 3 41 7.7.4 Rule 4 42 7.7.5 Rule 5 42 7.7.6 Rule 6 44 7.7.7 Rule 7 45 7.7.8 Rule 8 45 7.7.9 Rule 9 46 7.7.10 Rule 10 47 7.7.11 Rule 11 48 7.7.12 Rule 12 48 7.7.13 Rule 13 50 7.7.14 Rule 14 50 7.7.15 Rule 15 51 7.7.16 Rule 16 52 7.8 Detection Rules for OWL data set with moving window value 500 53 7.8.1 Rule 1 53 7.8.2 Rule 2 54 7.8.3 Rule 3 54 7.8.4 Rule 4 55 7.8.5 Rule 5 56 7.8.6 Rule 6 56 7.8.7 Rule 7 57 7.8.8 Rule 8 57 7.8.9 Rule 9 58 7.8.10 Rule 10 59 7.8.11 Rule 11 60 7.9 Rule Comparison with Moving Window Approach 60 7.9.1 Malware Rule Comparison 61 7.9.2 An Example of Detection Rules of Malware Type: Win32/Allaple.155648 62 7.10 Accuracy Comparison with K-means 63 8 Conclusion 65 References 65	zh_TW
dc.format.extent	1589924 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0104356019	en_US
dc.subject (關鍵詞)	非監督式分群	zh_TW
dc.subject (關鍵詞)	GHSOM	zh_TW
dc.subject (關鍵詞)	Actor Model	zh_TW
dc.subject (關鍵詞)	惡意程式偵測	zh_TW
dc.subject (關鍵詞)	平行運算	zh_TW
dc.subject (關鍵詞)	Unsupervised clustering	en_US
dc.subject (關鍵詞)	GHSOM	en_US
dc.subject (關鍵詞)	Actor model	en_US
dc.subject (關鍵詞)	Malware detection	en_US
dc.subject (關鍵詞)	Parallel computation	en_US
dc.title (題名)	基於大數據資料的非監督分散式分群演算法	zh_TW
dc.title (題名)	An Effective Distributed GHSOM Algorithm for Unsupervised Clustering on Big Data	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	[1] "Kvm," http://www.linux-kvm.org/page/Main Page/, (Visited on 7/15/2016). [2] S.-W. Lee and F. Yu, "Securing kvm-based cloud systems via virtualization intro- spection," in System Sciences (HICSS), 2014 47th Hawaii International Conference on. IEEE, 2014, pp. 5028-5037. [3] T. Kohonen, "The self-organizing map," Neurocomputing, vol. 21, no. 1, pp. 1-6, 1998. [4] J. Vesanto, "Som-based data visualization methods," Intelligent data analysis, vol. 3, no. 2, pp. 111-126, 1999. [5] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E. S. Lan- der, and T. R. Golub, "Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation," Proceedings of the National Academy of Sciences, vol. 96, no. 6, pp. 2907-2912, 1999. [6] E. Alhoniemi, J. Hollmen, O. Simula, and J. Vesanto, "Process monitoring and mod- eling using the self-organizing map," Integrated Computer-Aided Engineering, vol. 6, no. 1, pp. 3-14, 1999. [7] A. M. Kalteh, P. Hjorth, and R. Berndtsson, "Review of the self-organizing map (som) approach in water resources: Analysis, modelling and application," Environ- mental Modelling & Software, vol. 23, no. 7, pp. 835-845, 2008. [8] E. J. Palomo, J. North, D. Elizondo, R. M. Luque, and T. Watson, "Application of growing hierarchical som for visualisation of network forensics traffic data," Neural Networks, vol. 32, pp. 275-284, 2012. [9] S.-Y. Huang and Y.-N. Huang, "Network traffic anomaly detection based on growing hierarchical som," in Dependable Systems and Networks (DSN), 2013 43rd Annual IEEE/IFIP International Conference on. IEEE, 2013, pp. 1-2. [10] Y.-H. Li, Y.-R. Tzeng, and F. Yu, "Viso: Characterizing malicious behaviors of virtual machines with unsupervised clustering," in Cloud Computing Technology and Science (CloudCom), 2015 IEEE 7th International Conference on. IEEE, 2015, pp. 34-41. [11] R. M. Esteves, R. Pais, and C. Rong, "K-means clustering in the cloud-a mahout test," in Advanced Information Networking and Applications (WAINA), 2011 IEEE Workshops of International Conference on. IEEE, 2011, pp. 514-519. [12] "Apache mahout," http://mahout.apache.org/. [13] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu, "An effcient k-means clustering algorithm: Analysis and implementa- tion," IEEE transactions on pattern analysis and machine intelligence, vol. 24, no. 7, pp. 881-892, 2002. [14] A. McAfee, E. Brynjolfsson, T. H. Davenport, D. Patil, and D. Barton, "Big data," The management revolution. Harvard Bus Rev, vol. 90, no. 10, pp. 61-67, 2012. [15] A. Fahad, N. Alshatri, Z. Tari, A. Alamri, I. Khalil, A. Y. Zomaya, S. Foufou, and A. Bouras, "A survey of clustering algorithms for big data: Taxonomy and empirical analysis," IEEE transactions on emerging topics in computing, vol. 2, no. 3, pp. 267-279, 2014. [16] T. K. Moon, "The expectation-maximization algorithm," IEEE Signal processing magazine, vol. 13, no. 6, pp. 47-60, 1996. [17] "Bloom filter," http://en.wikipedia.org/wiki/Bloom_filter/, (Visited on 10/15/2016). [18] K. Leung and C. Leckie, "Unsupervised anomaly detection in network intrusion de- tection using clusters," in Proceedings of the Twenty-eighth Australasian conference on Computer Science-Volume 38. Australian Computer Society, Inc., 2005, pp. 333-342. [19] I. Burguera, U. Zurutuza, and S. Nadjm-Tehrani, "Crowdroid: behavior-based mal- ware detection system for android," in Proceedings of the 1st ACM workshop on Security and privacy in smartphones and mobile devices. ACM, 2011, pp. 15-26. [20] C. Hewitt, "Actor model of computation: scalable robust information systems," arXiv preprint arXiv:1008.1459, 2010. [21] "Akka," http://akka.io/, (Visited on 10/15/2016). [22] "Cuckoo sandbox," http://cuckoosandbox.org/, (Visited on 7/15/2016). [23] "Malware knowledge base," http://owl.nchc.org.tw/, (Visited on 6/20/2016). [24] S.-W. Hsiao, Y.-N. Chen, Y. S. Sun, and M. C. Chen, "Combining dynamic pas- sive analysis and active fingerprinting for effective bot malware detection in virtu- alized environments," in International Conference on Network and System Security. Springer, 2013, pp. 699-706. [25] "Virustotal," https://www.virustotal.com, (Visited on 4/15/2017). [26] M. Dittenbach, D. Merkl, and A. Rauber, "The growing hierarchical self-organizing map," in Neural Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on, vol. 6. IEEE, 2000, pp. 15-19. [27] J. A. Hartigan and M. A.Wong, "Algorithm as 136: A k-means clustering algorithm," Journal of the Royal Statistical Society. Series C (Applied Statistics), vol. 28, no. 1, pp. 100-108, 1979. [28] A. Broder and M. Mitzenmacher, "Network applications of bloom filters: A survey," Internet mathematics, vol. 1, no. 4, pp. 485-509, 2004.	zh_TW

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM