學術產出-學位論文

文章檢視/開啟

書目匯出

Google ScholarTM

政大圖書館

引文資訊

TAIR相關學術產出

題名 歸納惡意軟體特徵
Malware Family Characterization
作者 劉其峰
Liu, Chi-Feng
貢獻者 郁方
Yu, Fang
劉其峰
Liu, Chi-Feng
關鍵詞 遞歸神經網路
增長層級式自我組織映射圖
長短期記憶
惡意軟體
動態分析
序列編碼
RNN
GHSOM
LSTM
Malware
Sequence encoding
Dynamic analysis
日期 2018
上傳時間 3-九月-2018 15:47:50 (UTC+8)
摘要 Nowadays, a massive amount of sensitive data which are accessible and connected through personal computers and cloud services attracts hackers to develop malicious software (malware) to steal them. Owing to the success of deep learning on image and language recognition, researchers direct security systems to analyze and identify malware with deep learning approaches. This paper addresses the problem of analyzing and identifying complex and unstructured malware behaviors by proposing a framework of combining unsupervised and supervised learning algorithms with a novel sequence-aware encoding method. Particularly, we adopt a hybrid GHSOM (the Growing Hierarchical Self-Organizing Map) algorithm to cluster and encode similar malware behavior sequences from system call sequences to clustering feature vectors. Then, a Recurrent Neural Network (RNN) is trained to detect malware and predict their corresponding malware families based on the sequence of the behavior vectors. Our experiments show that the accuracy rate can be up to 0.98 in malware detection and 0.719 in malware classification of an 18-category malware dataset.
參考文獻 [1] A.-r. M. https://commons.wikimedia.org/wiki/User:BiObserve (Raster version previously uploaded to Wikimedia)Alex Graves and G. H. (original)Eddie Antonio Santos (SVG version with TeX math), “Peephole long short-term memory,” ”[CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], via Wikimedia Commons”.
[2] R. J. Canzanese Jr, “Detection and classification of malicious processes using system all analysis,” Ph.D. dissertation, Drexel University, 2015.
[3] T. Moore, D. J. Pym, C. Ioannidis et al., Economics of information security and privacy. Springer, 2010.
[4] N. Idika and A. P. Mathur, “A survey of malware detection techniques,” Purdue University, vol. 48, 2007.
[5] “Manalyze,” https://github.com/JusticeRage/Manalyze, [Online; accessed 4-May2018].
[6] S. Forrest, S. A. Hofmeyr, A. Somayaji, and T. A. Longstaff, “A sense of self for unix processes,” in Security and Privacy, 1996. Proceedings., 1996 IEEE Symposium on. IEEE, 1996, pp. 120–128.
[7] M. Rhode, P. Burnap, and K. Jones, “Early stage malware prediction using recurrent neural networks,” arXiv preprint arXiv:1708.03513, 2017.
[8] X. Wang and S. M. Yiu, “A multi-task learning model for malware classification with useful file access pattern from api call sequence,” arXiv preprint arXiv:1610.05945, 2016.
[9] B. Kolosnjaji, A. Zarras, G. Webster, and C. Eckert, “Deep learning for classification of malware system call sequences,” in Australasian Joint Conference on Artificial Intelligence. Springer, 2016, pp. 137–149.
[10] S. Tobiyama, Y. Yamaguchi, H. Shimada, T. Ikuse, and T. Yagi, “Malware detection with deep neural network using process behavior,” in Computer Software and Applications Conference (COMPSAC), 2016 IEEE 40th Annual, vol. 2. IEEE, 2016, pp. 577–582.
[11] R. Pascanu, J. W. Stokes, H. Sanossian, M. Marinescu, and A. Thomas, “Malware classification with recurrent networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 1916–1920.
[12] C.-H. Chiu, J.-J. Chen, and F. Yu, “An effective distributed ghsom algorithm for unsupervised clustering on big data,” in Big Data (BigData Congress), 2017 IEEE International Congress on. IEEE, 2017, pp. 297–304.
[13] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997. [Online]. Available: http://dx.doi.org/10. 1162/neco.1997.9.8.1735
[14] H. Sak, A. Senior, and F. Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling,” in Fifteenth annual conference of the international speech communication association, 2014.
[15] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE transactions on neural networks, vol. 5, no. 2, pp. 157–166, 1994.
[16] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: Continual prediction with lstm,” 1999.
[17] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
[18] T. Mikolov, M. Karafi´at, L. Burget, J. Cernock"y, and S. Khudanpur, “Recurrent ˇ neural network based language model,” in Eleventh Annual Conference of the International Speech Communication Association, 2010.
[19] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in Acoustics, speech and signal processing (icassp), 2013 ieee international conference on. IEEE, 2013, pp. 6645–6649.
[20] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104– 3112.
[21] T. Kohonen, “The self-organizing map,” Proceedings of the IEEE, vol. 78, no. 9, pp. 1464–1480, 1990.
[22] A. Rauber, D. Merkl, and M. Dittenbach, “The growing hierarchical self-organizing map: exploratory analysis of high-dimensional data,” IEEE Transactions on Neural Networks, vol. 13, no. 6, pp. 1331–1341, 2002.
[23] H. Shi, T. Hamagami, K. Yoshioka, H. Xu, K. Tobe, and S. Goto, “Structural classification and similarity measurement of malware,” IEEJ Transactions on Electrical and Electronic Engineering, vol. 9, no. 6, pp. 621–632, 2014.
[24] W. Shuwei, W. Baosheng, Y. Tang, and Y. Bo, “Malware clustering based on snn density using system calls,” in International Conference on Cloud Computing and Security. Springer, 2015, pp. 181–191.
[25] M. Dittenbach, D. Merkl, and A. Rauber, “The growing hierarchical self-organizing map,” in Neural Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on, vol. 6. IEEE, 2000, pp. 15–19.
[26] C. Guarnieri, A. Tanasi, J. Bremer, and M. Schloesser, “The cuckoo sandbox,” 2012.
[27] Y.-H. Li, Y.-R. Tzeng, and F. Yu, “Viso: Characterizing malicious behaviors of virtual machines with unsupervised clustering,” in Cloud Computing Technology and Science (CloudCom), 2015 IEEE 7th International Conference on. IEEE, 2015, pp. 34–41.
[28] S.-W. Lee and F. Yu, “Securing kvm-based cloud systems via virtualization introspection,” in System Sciences (HICSS), 2014 47th Hawaii International Conference on. IEEE, 2014, pp. 5028–5037.
[29] F. Yu, S.-y. Huang, L.-c. Chiou, and R.-h. Tsaih, “Clustering ios executable using self-organizing maps,” in Neural Networks (IJCNN), The 2013 International Joint Conference on. IEEE, 2013, pp. 1–8.
[30] R.-S. Pirscoveanu, M. Stevanovic, and J. M. Pedersen, “Clustering analysis of malware behavior using self organizing map,” in Cyber Situational Awareness, Data Analytics And Assessment (CyberSA), 2016 International Conference On. IEEE, 2016, pp. 1–6.
[31] S. Marinai, E. Marino, and G. Soda, “Embedded map projection for dimensionality reduction-based similarity search,” in Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). Springer, 2008, pp. 582–591.
[32] “Virustotal,” https://www.virustotal.com/en/, [Online; accessed 4-April-2018].
[33] M. Sebasti´an, R. Rivera, P. Kotzias, and J. Caballero, “Avclass: A tool for massive malware labeling,” in International Symposium on Research in Attacks, Intrusions, and Defenses. Springer, 2016, pp. 230–253.
[34] Z. C. Lipton, J. Berkowitz, and C. Elkan, “A critical review of recurrent neural networks for sequence learning,” arXiv preprint arXiv:1506.00019, 2015.
[35] W. Hu and Y. Tan, “Black-box attacks against rnn based malware detection algorithms,” arXiv preprint arXiv:1705.08131, 2017.
[36] “strace(1) - linux man page,” https://linux.die.net/man/1/strace, [Online; accessed 5-April-2018].
[37] S.-W. Hsiao, Y.-N. Chen, Y. S. Sun, and M. C. Chen, “A cooperative botnet profiling
and detection in virtualized environment,” in Communications and Network Security (CNS), 2013 IEEE Conference on. IEEE, 2013, pp. 154–162.
[38] “Linux syscall reference,” https://syscalls.kernelgrok.com/, [Online; accessed 11- August-2018].
描述 碩士
國立政治大學
資訊管理學系
105356019
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0105356019
資料類型 thesis
dc.contributor.advisor 郁方zh_TW
dc.contributor.advisor Yu, Fangen_US
dc.contributor.author (作者) 劉其峰zh_TW
dc.contributor.author (作者) Liu, Chi-Fengen_US
dc.creator (作者) 劉其峰zh_TW
dc.creator (作者) Liu, Chi-Fengen_US
dc.date (日期) 2018en_US
dc.date.accessioned 3-九月-2018 15:47:50 (UTC+8)-
dc.date.available 3-九月-2018 15:47:50 (UTC+8)-
dc.date.issued (上傳時間) 3-九月-2018 15:47:50 (UTC+8)-
dc.identifier (其他 識別碼) G0105356019en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/119881-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊管理學系zh_TW
dc.description (描述) 105356019zh_TW
dc.description.abstract (摘要) Nowadays, a massive amount of sensitive data which are accessible and connected through personal computers and cloud services attracts hackers to develop malicious software (malware) to steal them. Owing to the success of deep learning on image and language recognition, researchers direct security systems to analyze and identify malware with deep learning approaches. This paper addresses the problem of analyzing and identifying complex and unstructured malware behaviors by proposing a framework of combining unsupervised and supervised learning algorithms with a novel sequence-aware encoding method. Particularly, we adopt a hybrid GHSOM (the Growing Hierarchical Self-Organizing Map) algorithm to cluster and encode similar malware behavior sequences from system call sequences to clustering feature vectors. Then, a Recurrent Neural Network (RNN) is trained to detect malware and predict their corresponding malware families based on the sequence of the behavior vectors. Our experiments show that the accuracy rate can be up to 0.98 in malware detection and 0.719 in malware classification of an 18-category malware dataset.en_US
dc.description.tableofcontents Contents
1 Introduction 1
2 Related Work 3
2.1 Neural Network for Malware Classification and Detection 3
2.2 Recurrent Neural Networks 4
2.3 Unsupervised Learning Clustering 5
3 Methodology 6
3.1 System Overview 6
3.2 Software Pool 7
3.3 System Call Frequency Encoding Method 8
3.4 Unsupervised Learning Clustering 9
3.5 The Hierarchical SOM Encoding Method 11
3.5.1 Decimal Encoding Method 12
3.5.2 Weighted One-hot Encoding Method 13
3.6 Malware Family Labeling 15
3.7 Recurrent Neural Network 15
4 Experiment 17
4.1 System Architecture 17
4.2 Experiment Dataset 18
4.3 Unsupervised Learning Clustering 19
4.4 Recurrent Neural Network 20
4.5 Malware Detection Experiment 20
4.6 Malware Family Classification Experiment 21
5 Conclusion 23

List of Figures
1 Long Short-Term Memory Model [1] 4
2 System overview 7
3 The structure of GHSOM 10
4 Decimal encoding method 12
5 The weighted one-hot encoding method 14
6 Many-to-one LSTM model with softmax layer 16
7 System architecture 17
8 1000-gram interval behavior vectors 19
9 Testing accuracy of 1000-gram decimal encoding vector 24
10 Testing accuracy of 1000-gram weighted one-hot encoding vector 25
11 Testing accuracy of 2000-gram decimal encoding vector 26
12 Testing accuracy of 2000-gram weighted one-hot encoding vector 27

List of Tables
1 System calls reference [2] 19
2 Interval behavior cluster vectors by weighted one-hot encoding method 20
3 Interval behavior cluster vectors by decimal encoding method 21
4 Result of malware detection experiment 21
5 Training & testing accuracy of 1000-gram vectors by decimal encoding
method 22
6 Training & testing accuracy of 1000-gram vectors by weighted one-hot en-
coding method 22
7 Training & testing accuracy of 2000-gram vectors by decimal encoding
method 23
8 Training & testing accuracy of 2000-gram vectors by weighted one-hot en-
coding method 24
zh_TW
dc.format.extent 846476 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0105356019en_US
dc.subject (關鍵詞) 遞歸神經網路zh_TW
dc.subject (關鍵詞) 增長層級式自我組織映射圖zh_TW
dc.subject (關鍵詞) 長短期記憶zh_TW
dc.subject (關鍵詞) 惡意軟體zh_TW
dc.subject (關鍵詞) 動態分析zh_TW
dc.subject (關鍵詞) 序列編碼zh_TW
dc.subject (關鍵詞) RNNen_US
dc.subject (關鍵詞) GHSOMen_US
dc.subject (關鍵詞) LSTMen_US
dc.subject (關鍵詞) Malwareen_US
dc.subject (關鍵詞) Sequence encodingen_US
dc.subject (關鍵詞) Dynamic analysisen_US
dc.title (題名) 歸納惡意軟體特徵zh_TW
dc.title (題名) Malware Family Characterizationen_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) [1] A.-r. M. https://commons.wikimedia.org/wiki/User:BiObserve (Raster version previously uploaded to Wikimedia)Alex Graves and G. H. (original)Eddie Antonio Santos (SVG version with TeX math), “Peephole long short-term memory,” ”[CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], via Wikimedia Commons”.
[2] R. J. Canzanese Jr, “Detection and classification of malicious processes using system all analysis,” Ph.D. dissertation, Drexel University, 2015.
[3] T. Moore, D. J. Pym, C. Ioannidis et al., Economics of information security and privacy. Springer, 2010.
[4] N. Idika and A. P. Mathur, “A survey of malware detection techniques,” Purdue University, vol. 48, 2007.
[5] “Manalyze,” https://github.com/JusticeRage/Manalyze, [Online; accessed 4-May2018].
[6] S. Forrest, S. A. Hofmeyr, A. Somayaji, and T. A. Longstaff, “A sense of self for unix processes,” in Security and Privacy, 1996. Proceedings., 1996 IEEE Symposium on. IEEE, 1996, pp. 120–128.
[7] M. Rhode, P. Burnap, and K. Jones, “Early stage malware prediction using recurrent neural networks,” arXiv preprint arXiv:1708.03513, 2017.
[8] X. Wang and S. M. Yiu, “A multi-task learning model for malware classification with useful file access pattern from api call sequence,” arXiv preprint arXiv:1610.05945, 2016.
[9] B. Kolosnjaji, A. Zarras, G. Webster, and C. Eckert, “Deep learning for classification of malware system call sequences,” in Australasian Joint Conference on Artificial Intelligence. Springer, 2016, pp. 137–149.
[10] S. Tobiyama, Y. Yamaguchi, H. Shimada, T. Ikuse, and T. Yagi, “Malware detection with deep neural network using process behavior,” in Computer Software and Applications Conference (COMPSAC), 2016 IEEE 40th Annual, vol. 2. IEEE, 2016, pp. 577–582.
[11] R. Pascanu, J. W. Stokes, H. Sanossian, M. Marinescu, and A. Thomas, “Malware classification with recurrent networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 1916–1920.
[12] C.-H. Chiu, J.-J. Chen, and F. Yu, “An effective distributed ghsom algorithm for unsupervised clustering on big data,” in Big Data (BigData Congress), 2017 IEEE International Congress on. IEEE, 2017, pp. 297–304.
[13] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997. [Online]. Available: http://dx.doi.org/10. 1162/neco.1997.9.8.1735
[14] H. Sak, A. Senior, and F. Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling,” in Fifteenth annual conference of the international speech communication association, 2014.
[15] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE transactions on neural networks, vol. 5, no. 2, pp. 157–166, 1994.
[16] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: Continual prediction with lstm,” 1999.
[17] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
[18] T. Mikolov, M. Karafi´at, L. Burget, J. Cernock"y, and S. Khudanpur, “Recurrent ˇ neural network based language model,” in Eleventh Annual Conference of the International Speech Communication Association, 2010.
[19] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in Acoustics, speech and signal processing (icassp), 2013 ieee international conference on. IEEE, 2013, pp. 6645–6649.
[20] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104– 3112.
[21] T. Kohonen, “The self-organizing map,” Proceedings of the IEEE, vol. 78, no. 9, pp. 1464–1480, 1990.
[22] A. Rauber, D. Merkl, and M. Dittenbach, “The growing hierarchical self-organizing map: exploratory analysis of high-dimensional data,” IEEE Transactions on Neural Networks, vol. 13, no. 6, pp. 1331–1341, 2002.
[23] H. Shi, T. Hamagami, K. Yoshioka, H. Xu, K. Tobe, and S. Goto, “Structural classification and similarity measurement of malware,” IEEJ Transactions on Electrical and Electronic Engineering, vol. 9, no. 6, pp. 621–632, 2014.
[24] W. Shuwei, W. Baosheng, Y. Tang, and Y. Bo, “Malware clustering based on snn density using system calls,” in International Conference on Cloud Computing and Security. Springer, 2015, pp. 181–191.
[25] M. Dittenbach, D. Merkl, and A. Rauber, “The growing hierarchical self-organizing map,” in Neural Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on, vol. 6. IEEE, 2000, pp. 15–19.
[26] C. Guarnieri, A. Tanasi, J. Bremer, and M. Schloesser, “The cuckoo sandbox,” 2012.
[27] Y.-H. Li, Y.-R. Tzeng, and F. Yu, “Viso: Characterizing malicious behaviors of virtual machines with unsupervised clustering,” in Cloud Computing Technology and Science (CloudCom), 2015 IEEE 7th International Conference on. IEEE, 2015, pp. 34–41.
[28] S.-W. Lee and F. Yu, “Securing kvm-based cloud systems via virtualization introspection,” in System Sciences (HICSS), 2014 47th Hawaii International Conference on. IEEE, 2014, pp. 5028–5037.
[29] F. Yu, S.-y. Huang, L.-c. Chiou, and R.-h. Tsaih, “Clustering ios executable using self-organizing maps,” in Neural Networks (IJCNN), The 2013 International Joint Conference on. IEEE, 2013, pp. 1–8.
[30] R.-S. Pirscoveanu, M. Stevanovic, and J. M. Pedersen, “Clustering analysis of malware behavior using self organizing map,” in Cyber Situational Awareness, Data Analytics And Assessment (CyberSA), 2016 International Conference On. IEEE, 2016, pp. 1–6.
[31] S. Marinai, E. Marino, and G. Soda, “Embedded map projection for dimensionality reduction-based similarity search,” in Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). Springer, 2008, pp. 582–591.
[32] “Virustotal,” https://www.virustotal.com/en/, [Online; accessed 4-April-2018].
[33] M. Sebasti´an, R. Rivera, P. Kotzias, and J. Caballero, “Avclass: A tool for massive malware labeling,” in International Symposium on Research in Attacks, Intrusions, and Defenses. Springer, 2016, pp. 230–253.
[34] Z. C. Lipton, J. Berkowitz, and C. Elkan, “A critical review of recurrent neural networks for sequence learning,” arXiv preprint arXiv:1506.00019, 2015.
[35] W. Hu and Y. Tan, “Black-box attacks against rnn based malware detection algorithms,” arXiv preprint arXiv:1705.08131, 2017.
[36] “strace(1) - linux man page,” https://linux.die.net/man/1/strace, [Online; accessed 5-April-2018].
[37] S.-W. Hsiao, Y.-N. Chen, Y. S. Sun, and M. C. Chen, “A cooperative botnet profiling
and detection in virtualized environment,” in Communications and Network Security (CNS), 2013 IEEE Conference on. IEEE, 2013, pp. 154–162.
[38] “Linux syscall reference,” https://syscalls.kernelgrok.com/, [Online; accessed 11- August-2018].
zh_TW
dc.identifier.doi (DOI) 10.6814/THE.NCCU.MIS.025.2018.A05-