基於圖神經網路提取惡意程式家族序列特徵

Publications-Theses

Article View/Open

pdf(0)

Publication Export

Google Scholar^TM

NCCU Library

Discovery System

Citation Infomation

Related Publications in TAIR

Simple Record
Full Record

題名	基於圖神經網路提取惡意程式家族序列特徵 Sequence Feature Extraction for Malware Family Analysis via Graph Neural Network
作者	朱柏瑜 Chu, Po-Yu
貢獻者	蕭舜文 Hsiao, Shun-Wen 朱柏瑜 Chu, Po-Yu
關鍵詞	圖神經網路注意力機制序列型資料馬可夫模型 Graph neural network Attention Sequential data Markov model
日期	2022
上傳時間	1-Aug-2022 17:22:11 (UTC+8)
摘要	由於惡意程式對我們的生活及電子裝置帶來許多危害，因此我們迫切的想了解惡意程式的行為及他們可能造成的危害。惡意程式所產生的紀錄檔大多是帶有時間戳記的不定長度文字型資料，像是事件紀錄檔或是動態分析紀錄檔。我們可以利用時間戳記將紀錄檔排序成序列型資料以利後續分析。然而，要處理這種可變長度的文字型序列資料是非常困難的。除此之外，在資訊安全領域中大多數的序列型資料都有特殊的屬性或是結構，例如：迴圈、重複調用及雜訊等自然語言中不會有的特性與結構。為了深入分析應用程式介面(API)調用序列及結構，本研究使用圖(如馬可夫模型)來深究隱含在序列中的資訊與結構。因此本研究設計並實作了注意力感知圖神經網路(AWGCN)來分析應用程式介面調用序列。透過注意力感知圖神經網路的訓練，我們可以得到序列嵌入用以分析惡意程式之行為。此外，在調用類型資料集的家族分類實驗中，注意力感知圖神經網路的準確度優於其他分類器，且序列嵌入也能增進經典模型的表現。 Malicious software (malware) causes much harm to our devices and life. We are eager to understand the malware behavior and the threat it made. Most of the record files of malware are variable length and text-based files with time stamps, such as event log data and dynamic analysis profiles. Using the time stamps, we can sort such data into sequence-based data for the following analysis. However, dealing with the text-based sequences with variable lengths is difficult. In addition, unlike natural language text data, most sequential data in information security have specific properties and structure, such as loop, repeated call, noise, etc. To deeply analyze the API call sequences with their structure, we use graphs to represent the sequences, which can further investigate the information and structure, such as the Markov model. Therefore, we design and implement an Attention Aware Graph Neural Network (AWGCN) to analyze the API call sequences. Through AWGCN, we can obtain the sequence embeddings to analyze the behavior of the malware. Moreover, the classification experiment result shows that AWGCN outperforms other classifiers in the call-like datasets, and the embedding can further improve the classic model’s performance.
參考文獻	T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013. Q. Le and T. Mikolov, “Distributed representations of sentences and documents,” in International conference on machine learning. PMLR, 2014, pp. 1188–1196. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017. C. Beek, T. Dunton, J. Fokker, S. Grobman, T. Hux, T. Polzer, M. Rivero, T. Roccia, J. Saavedra-Morales, R. Samani et al., “Mcafee labs threats report: August 2019,” McAfee Labs, 2019. “Malware statistics &; trends report: Av-test.” [Online]. Available: https://www.av-test.org/en/statistics/malware/ S. Alam, R. N. Horspool, I. Traore, and I. Sogukpinar, “A framework for metamorphic malware analysis and real-time detection,” computers & security, vol. 48, pp.212–233, 2015. M. Akbanov, V. G. Vassilakis, and M. D. Logothetis, “Wannacry ransomware: Analysis of infection, persistence, recovery prevention and propagation mechanisms,” Journal of Telecommunications and Information Technology, 2019. H. Sinanović and S. Mrdovic, “Analysis of mirai malicious software,” in 2017 25th International Conference on Software, Telecommunications and Computer Networks (SoftCOM), 2017, pp. 1–5. Y. Pan, X. Ge, C. Fang, and Y. Fan, “A systematic literature review of android malware detection using static analysis,” IEEE Access, vol. 8, pp. 116 363–116 379, 2020. M. Egele, T. Scholte, E. Kirda, and C. Kruegel, “A survey on automated dynamic malware-analysis techniques and tools,” ACM computing surveys (CSUR), vol. 44, no. 2, pp. 1–42, 2008. R. C. Edgar and S. Batzoglou, “Multiple sequence alignment,” Current opinion in structural biology, vol. 16, no. 3, pp. 368–373, 2006. R. Ronen, M. Radu, C. Feuerstein, E. Yom-Tov, and M. Ahmadi, “Microsoft malware classification challenge,” arXiv preprint arXiv:1802.10135, 2018. M. K. Shankarapani, S. Ramamoorthy, R. S. Movva, and S. Mukkamala, “Malware detection using assembly and api call sequences,” Journal in computer virology, vol. 7, no. 2, pp. 107–119, 2011. Y. Ki, E. Kim, and H. K. Kim, “A novel approach to detect malware based on api call sequence analysis,” International Journal of Distributed Sensor Networks, vol. 11, no. 6, p. 659101, 2015. “Bert (language model),” https://en.wikipedia.org/wiki/BERT_(language_model), accessed Jun. 26, 2022. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997. “Transformer (machine learning model),” https://en.wikipedia.org/wiki/Transformer_(machine_learning_model), accessed Jun. 26, 2022. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018. M. Zhang, Z. Cui, M. Neumann, and Y. Chen, “An end-to-end deep learning architecture for graph classification,” in Thirty-second AAAI conference on artificial intelligence, 2018. J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and locally connected networks on graphs,” 2013. [Online]. Available: https://arxiv.org/abs/1312.6203 Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip, “A comprehensive survey on graph neural networks,” IEEE transactions on neural networks and learning systems, vol. 32, no. 1, pp. 4–24, 2020. T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016. P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” arXiv preprint arXiv:1710.10903, 2017. W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” Advances in neural information processing systems, vol. 30, 2017. M. Reddy, API Design for C++. Elsevier, 2011. J. Wulf and I. Blohm, “Fostering value creation with digital platforms: A unified theory of the application programming interface design,” Journal of Management Information Systems, vol. 37, no. 1, pp. 251–281, 2020. E. Amer and I. Zelinka, “A dynamic windows malware detection and prediction method based on contextual understanding of api call sequence,” Computers & Security, vol. 92, p. 101760, 2020. M. Alazab, S. Venkatraman, P. Watters, M. Alazab et al., “Zero-day malware detection based on supervised learning algorithms of api call signatures,” 2010. “Markov chain,” https://en.wikipedia.org/wiki/Markov_chain, accessed Jun. 26, 2022. N. C. for High-performance Computing(NCHC) and T. C. S. I. R. Team(TWCSIRT)., “Malware knowledge base,” https://owl.nchc.org.tw/about.php, accessed May. 22, 2022. S.-W. Hsiao and Y.-J. Lee, “Nn-based feature selection for text-based sequential data,” 2020. A. Oliveira, “Malware analysis datasets: Api call sequences,” 2019. [Online]. Available: https://dx.doi.org/10.21227/tqqm-aq14 “Cuckoo,” https://cuckoosandbox.org/, accessed Jun. 28, 2022. “Adware.loadmoney,” https://blog.malwarebytes.com/detections/ adware-loadmoney/, accessed Jun. 29, 2022. “Adware.graftor,” https://blog.malwarebytes.com/detections/adware-graftor/, accessed Jun. 29, 2022.
描述	碩士國立政治大學資訊管理學系 109356020
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0109356020
資料類型	thesis

dc.contributor.advisor	蕭舜文	zh_TW
dc.contributor.advisor	Hsiao, Shun-Wen	en_US
dc.contributor.author (Authors)	朱柏瑜	zh_TW
dc.contributor.author (Authors)	Chu, Po-Yu	en_US
dc.creator (作者)	朱柏瑜	zh_TW
dc.creator (作者)	Chu, Po-Yu	en_US
dc.date (日期)	2022	en_US
dc.date.accessioned	1-Aug-2022 17:22:11 (UTC+8)	-
dc.date.available	1-Aug-2022 17:22:11 (UTC+8)	-
dc.date.issued (上傳時間)	1-Aug-2022 17:22:11 (UTC+8)	-
dc.identifier (Other Identifiers)	G0109356020	en_US
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/141035	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	資訊管理學系	zh_TW
dc.description (描述)	109356020	zh_TW
dc.description.abstract (摘要)	由於惡意程式對我們的生活及電子裝置帶來許多危害，因此我們迫切的想了解惡意程式的行為及他們可能造成的危害。惡意程式所產生的紀錄檔大多是帶有時間戳記的不定長度文字型資料，像是事件紀錄檔或是動態分析紀錄檔。我們可以利用時間戳記將紀錄檔排序成序列型資料以利後續分析。然而，要處理這種可變長度的文字型序列資料是非常困難的。除此之外，在資訊安全領域中大多數的序列型資料都有特殊的屬性或是結構，例如：迴圈、重複調用及雜訊等自然語言中不會有的特性與結構。為了深入分析應用程式介面(API)調用序列及結構，本研究使用圖(如馬可夫模型)來深究隱含在序列中的資訊與結構。因此本研究設計並實作了注意力感知圖神經網路(AWGCN)來分析應用程式介面調用序列。透過注意力感知圖神經網路的訓練，我們可以得到序列嵌入用以分析惡意程式之行為。此外，在調用類型資料集的家族分類實驗中，注意力感知圖神經網路的準確度優於其他分類器，且序列嵌入也能增進經典模型的表現。	zh_TW
dc.description.abstract (摘要)	Malicious software (malware) causes much harm to our devices and life. We are eager to understand the malware behavior and the threat it made. Most of the record files of malware are variable length and text-based files with time stamps, such as event log data and dynamic analysis profiles. Using the time stamps, we can sort such data into sequence-based data for the following analysis. However, dealing with the text-based sequences with variable lengths is difficult. In addition, unlike natural language text data, most sequential data in information security have specific properties and structure, such as loop, repeated call, noise, etc. To deeply analyze the API call sequences with their structure, we use graphs to represent the sequences, which can further investigate the information and structure, such as the Markov model. Therefore, we design and implement an Attention Aware Graph Neural Network (AWGCN) to analyze the API call sequences. Through AWGCN, we can obtain the sequence embeddings to analyze the behavior of the malware. Moreover, the classification experiment result shows that AWGCN outperforms other classifiers in the call-like datasets, and the embedding can further improve the classic model’s performance.	en_US
dc.description.tableofcontents	Abstract i 摘要 iii Contents iv List of Figures vi List of Tables vii 1 Introduction 1 2 Related Work 6 2.1 Text Embedding Algorithm 6 2.1.1 Bag-of-words and One-hot 6 2.1.2 Word2Vec 6 2.1.3 Doc2Vec 7 2.1.4 RNN 7 2.1.5 Transformer 7 2.2 Graph Neural Network 7 2.3 API Call Sequence 10 2.4 Markov Model 11 3 Design of Our Method 13 3.1 Overview 14 3.2 Preprocessing Module 14 3.3 Graph Generation Module 16 3.4 Graph Convolution Network Module 17 4 Experiment 20 4.1 Data Set 20 4.1.1 SynData 20 4.1.2 WinMal 21 4.1.3 Syscall 22 4.1.4 Oliveira 22 4.2 Family Classification Camparsion 24 4.3 Attention Mechanism 26 4.4 Representation of Malware Family 27 5 Discussion 33 6 Conclusion 34 Reference 35	zh_TW
dc.format.extent	2607954 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0109356020	en_US
dc.subject (關鍵詞)	圖神經網路	zh_TW
dc.subject (關鍵詞)	注意力機制	zh_TW
dc.subject (關鍵詞)	序列型資料	zh_TW
dc.subject (關鍵詞)	馬可夫模型	zh_TW
dc.subject (關鍵詞)	Graph neural network	en_US
dc.subject (關鍵詞)	Attention	en_US
dc.subject (關鍵詞)	Sequential data	en_US
dc.subject (關鍵詞)	Markov model	en_US
dc.title (題名)	基於圖神經網路提取惡意程式家族序列特徵	zh_TW
dc.title (題名)	Sequence Feature Extraction for Malware Family Analysis via Graph Neural Network	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013. Q. Le and T. Mikolov, “Distributed representations of sentences and documents,” in International conference on machine learning. PMLR, 2014, pp. 1188–1196. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017. C. Beek, T. Dunton, J. Fokker, S. Grobman, T. Hux, T. Polzer, M. Rivero, T. Roccia, J. Saavedra-Morales, R. Samani et al., “Mcafee labs threats report: August 2019,” McAfee Labs, 2019. “Malware statistics &; trends report: Av-test.” [Online]. Available: https://www.av-test.org/en/statistics/malware/ S. Alam, R. N. Horspool, I. Traore, and I. Sogukpinar, “A framework for metamorphic malware analysis and real-time detection,” computers & security, vol. 48, pp.212–233, 2015. M. Akbanov, V. G. Vassilakis, and M. D. Logothetis, “Wannacry ransomware: Analysis of infection, persistence, recovery prevention and propagation mechanisms,” Journal of Telecommunications and Information Technology, 2019. H. Sinanović and S. Mrdovic, “Analysis of mirai malicious software,” in 2017 25th International Conference on Software, Telecommunications and Computer Networks (SoftCOM), 2017, pp. 1–5. Y. Pan, X. Ge, C. Fang, and Y. Fan, “A systematic literature review of android malware detection using static analysis,” IEEE Access, vol. 8, pp. 116 363–116 379, 2020. M. Egele, T. Scholte, E. Kirda, and C. Kruegel, “A survey on automated dynamic malware-analysis techniques and tools,” ACM computing surveys (CSUR), vol. 44, no. 2, pp. 1–42, 2008. R. C. Edgar and S. Batzoglou, “Multiple sequence alignment,” Current opinion in structural biology, vol. 16, no. 3, pp. 368–373, 2006. R. Ronen, M. Radu, C. Feuerstein, E. Yom-Tov, and M. Ahmadi, “Microsoft malware classification challenge,” arXiv preprint arXiv:1802.10135, 2018. M. K. Shankarapani, S. Ramamoorthy, R. S. Movva, and S. Mukkamala, “Malware detection using assembly and api call sequences,” Journal in computer virology, vol. 7, no. 2, pp. 107–119, 2011. Y. Ki, E. Kim, and H. K. Kim, “A novel approach to detect malware based on api call sequence analysis,” International Journal of Distributed Sensor Networks, vol. 11, no. 6, p. 659101, 2015. “Bert (language model),” https://en.wikipedia.org/wiki/BERT_(language_model), accessed Jun. 26, 2022. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997. “Transformer (machine learning model),” https://en.wikipedia.org/wiki/Transformer_(machine_learning_model), accessed Jun. 26, 2022. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018. M. Zhang, Z. Cui, M. Neumann, and Y. Chen, “An end-to-end deep learning architecture for graph classification,” in Thirty-second AAAI conference on artificial intelligence, 2018. J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and locally connected networks on graphs,” 2013. [Online]. Available: https://arxiv.org/abs/1312.6203 Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip, “A comprehensive survey on graph neural networks,” IEEE transactions on neural networks and learning systems, vol. 32, no. 1, pp. 4–24, 2020. T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016. P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” arXiv preprint arXiv:1710.10903, 2017. W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” Advances in neural information processing systems, vol. 30, 2017. M. Reddy, API Design for C++. Elsevier, 2011. J. Wulf and I. Blohm, “Fostering value creation with digital platforms: A unified theory of the application programming interface design,” Journal of Management Information Systems, vol. 37, no. 1, pp. 251–281, 2020. E. Amer and I. Zelinka, “A dynamic windows malware detection and prediction method based on contextual understanding of api call sequence,” Computers & Security, vol. 92, p. 101760, 2020. M. Alazab, S. Venkatraman, P. Watters, M. Alazab et al., “Zero-day malware detection based on supervised learning algorithms of api call signatures,” 2010. “Markov chain,” https://en.wikipedia.org/wiki/Markov_chain, accessed Jun. 26, 2022. N. C. for High-performance Computing(NCHC) and T. C. S. I. R. Team(TWCSIRT)., “Malware knowledge base,” https://owl.nchc.org.tw/about.php, accessed May. 22, 2022. S.-W. Hsiao and Y.-J. Lee, “Nn-based feature selection for text-based sequential data,” 2020. A. Oliveira, “Malware analysis datasets: Api call sequences,” 2019. [Online]. Available: https://dx.doi.org/10.21227/tqqm-aq14 “Cuckoo,” https://cuckoosandbox.org/, accessed Jun. 28, 2022. “Adware.loadmoney,” https://blog.malwarebytes.com/detections/ adware-loadmoney/, accessed Jun. 29, 2022. “Adware.graftor,” https://blog.malwarebytes.com/detections/adware-graftor/, accessed Jun. 29, 2022.	zh_TW
dc.identifier.doi (DOI)	10.6814/NCCU202200886	en_US

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM