Publications-Theses
Article View/Open
Publication Export
-
題名 基於圖神經網路提取惡意程式家族序列特徵
Sequence Feature Extraction for Malware Family Analysis via Graph Neural Network作者 朱柏瑜
Chu, Po-Yu貢獻者 蕭舜文
Hsiao, Shun-Wen
朱柏瑜
Chu, Po-Yu關鍵詞 圖神經網路
注意力機制
序列型資料
馬可夫模型
Graph neural network
Attention
Sequential data
Markov model日期 2022 上傳時間 1-Aug-2022 17:22:11 (UTC+8) 摘要 由於惡意程式對我們的生活及電子裝置帶來許多危害,因此我們迫切的想了解惡意程式的行為及他們可能造成的危害。惡意程式所產生的紀錄檔大多是帶有時間戳記的不定長度文字型資料,像是事件紀錄檔或是動態分析紀錄檔。我們可以利用時間戳記將紀錄檔排序成序列型資料以利後續分析。然而,要處理這種可變長度的文字型序列資料是非常困難的。除此之外,在資訊安全領域中大多數的序列型資料都有特殊的屬性或是結構,例如:迴圈、重複調用及雜訊等自然語言中不會有的特性與結構。為了深入分析應用程式介面(API)調用序列及結構,本研究使用圖(如馬可夫模型)來深究隱含在序列中的資訊與結構。因此本研究設計並實作了注意力感知圖神經網路(AWGCN)來分析應用程式介面調用序列。透過注意力感知圖神經網路的訓練,我們可以得到序列嵌入用以分析惡意程式之行為。此外,在調用類型資料集的家族分類實驗中,注意力感知圖神經網路的準確度優於其他分類器,且序列嵌入也能增進經典模型的表現。
Malicious software (malware) causes much harm to our devices and life. We are eager to understand the malware behavior and the threat it made. Most of the record files of malware are variable length and text-based files with time stamps, such as event log data and dynamic analysis profiles. Using the time stamps, we can sort such data into sequence-based data for the following analysis. However, dealing with the text-based sequences with variable lengths is difficult. In addition, unlike natural language text data, most sequential data in information security have specific properties and structure, such as loop, repeated call, noise, etc. To deeply analyze the API call sequences with their structure, we use graphs to represent the sequences, which can further investigate the information and structure, such as the Markov model. Therefore, we design and implement an Attention Aware Graph Neural Network (AWGCN) to analyze the API call sequences. Through AWGCN, we can obtain the sequence embeddings to analyze the behavior of the malware. Moreover, the classification experiment result shows that AWGCN outperforms other classifiers in the call-like datasets, and the embedding can further improve the classic model’s performance.參考文獻 T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.Q. Le and T. Mikolov, “Distributed representations of sentences and documents,” in International conference on machine learning. PMLR, 2014, pp. 1188–1196.A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.C. Beek, T. Dunton, J. Fokker, S. Grobman, T. Hux, T. Polzer, M. Rivero, T. Roccia, J. Saavedra-Morales, R. Samani et al., “Mcafee labs threats report: August 2019,” McAfee Labs, 2019.“Malware statistics &; trends report: Av-test.” [Online]. Available: https://www.av-test.org/en/statistics/malware/S. Alam, R. N. Horspool, I. Traore, and I. Sogukpinar, “A framework for metamorphic malware analysis and real-time detection,” computers & security, vol. 48, pp.212–233, 2015.M. Akbanov, V. G. Vassilakis, and M. D. Logothetis, “Wannacry ransomware: Analysis of infection, persistence, recovery prevention and propagation mechanisms,” Journal of Telecommunications and Information Technology, 2019.H. Sinanović and S. Mrdovic, “Analysis of mirai malicious software,” in 2017 25th International Conference on Software, Telecommunications and Computer Networks (SoftCOM), 2017, pp. 1–5.Y. Pan, X. Ge, C. Fang, and Y. Fan, “A systematic literature review of android malware detection using static analysis,” IEEE Access, vol. 8, pp. 116 363–116 379, 2020.M. Egele, T. Scholte, E. Kirda, and C. Kruegel, “A survey on automated dynamic malware-analysis techniques and tools,” ACM computing surveys (CSUR), vol. 44, no. 2, pp. 1–42, 2008.R. C. Edgar and S. Batzoglou, “Multiple sequence alignment,” Current opinion in structural biology, vol. 16, no. 3, pp. 368–373, 2006.R. Ronen, M. Radu, C. Feuerstein, E. Yom-Tov, and M. Ahmadi, “Microsoft malware classification challenge,” arXiv preprint arXiv:1802.10135, 2018.M. K. Shankarapani, S. Ramamoorthy, R. S. Movva, and S. Mukkamala, “Malware detection using assembly and api call sequences,” Journal in computer virology, vol. 7, no. 2, pp. 107–119, 2011.Y. Ki, E. Kim, and H. K. Kim, “A novel approach to detect malware based on api call sequence analysis,” International Journal of Distributed Sensor Networks, vol. 11, no. 6, p. 659101, 2015.“Bert (language model),” https://en.wikipedia.org/wiki/BERT_(language_model), accessed Jun. 26, 2022.S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.“Transformer (machine learning model),” https://en.wikipedia.org/wiki/Transformer_(machine_learning_model), accessed Jun. 26, 2022.J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.M. Zhang, Z. Cui, M. Neumann, and Y. Chen, “An end-to-end deep learning architecture for graph classification,” in Thirty-second AAAI conference on artificial intelligence, 2018.J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and locally connected networks on graphs,” 2013. [Online]. Available: https://arxiv.org/abs/1312.6203Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip, “A comprehensive survey on graph neural networks,” IEEE transactions on neural networks and learning systems, vol. 32, no. 1, pp. 4–24, 2020.T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” arXiv preprint arXiv:1710.10903, 2017.W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” Advances in neural information processing systems, vol. 30, 2017.M. Reddy, API Design for C++. Elsevier, 2011.J. Wulf and I. Blohm, “Fostering value creation with digital platforms: A unified theory of the application programming interface design,” Journal of Management Information Systems, vol. 37, no. 1, pp. 251–281, 2020.E. Amer and I. Zelinka, “A dynamic windows malware detection and prediction method based on contextual understanding of api call sequence,” Computers & Security, vol. 92, p. 101760, 2020.M. Alazab, S. Venkatraman, P. Watters, M. Alazab et al., “Zero-day malware detection based on supervised learning algorithms of api call signatures,” 2010.“Markov chain,” https://en.wikipedia.org/wiki/Markov_chain, accessed Jun. 26, 2022.N. C. for High-performance Computing(NCHC) and T. C. S. I. R. Team(TWCSIRT)., “Malware knowledge base,” https://owl.nchc.org.tw/about.php, accessed May. 22, 2022.S.-W. Hsiao and Y.-J. Lee, “Nn-based feature selection for text-based sequential data,” 2020.A. Oliveira, “Malware analysis datasets: Api call sequences,” 2019. [Online]. Available: https://dx.doi.org/10.21227/tqqm-aq14“Cuckoo,” https://cuckoosandbox.org/, accessed Jun. 28, 2022.“Adware.loadmoney,” https://blog.malwarebytes.com/detections/ adware-loadmoney/, accessed Jun. 29, 2022.“Adware.graftor,” https://blog.malwarebytes.com/detections/adware-graftor/, accessed Jun. 29, 2022. 描述 碩士
國立政治大學
資訊管理學系
109356020資料來源 http://thesis.lib.nccu.edu.tw/record/#G0109356020 資料類型 thesis dc.contributor.advisor 蕭舜文 zh_TW dc.contributor.advisor Hsiao, Shun-Wen en_US dc.contributor.author (Authors) 朱柏瑜 zh_TW dc.contributor.author (Authors) Chu, Po-Yu en_US dc.creator (作者) 朱柏瑜 zh_TW dc.creator (作者) Chu, Po-Yu en_US dc.date (日期) 2022 en_US dc.date.accessioned 1-Aug-2022 17:22:11 (UTC+8) - dc.date.available 1-Aug-2022 17:22:11 (UTC+8) - dc.date.issued (上傳時間) 1-Aug-2022 17:22:11 (UTC+8) - dc.identifier (Other Identifiers) G0109356020 en_US dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/141035 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 資訊管理學系 zh_TW dc.description (描述) 109356020 zh_TW dc.description.abstract (摘要) 由於惡意程式對我們的生活及電子裝置帶來許多危害,因此我們迫切的想了解惡意程式的行為及他們可能造成的危害。惡意程式所產生的紀錄檔大多是帶有時間戳記的不定長度文字型資料,像是事件紀錄檔或是動態分析紀錄檔。我們可以利用時間戳記將紀錄檔排序成序列型資料以利後續分析。然而,要處理這種可變長度的文字型序列資料是非常困難的。除此之外,在資訊安全領域中大多數的序列型資料都有特殊的屬性或是結構,例如:迴圈、重複調用及雜訊等自然語言中不會有的特性與結構。為了深入分析應用程式介面(API)調用序列及結構,本研究使用圖(如馬可夫模型)來深究隱含在序列中的資訊與結構。因此本研究設計並實作了注意力感知圖神經網路(AWGCN)來分析應用程式介面調用序列。透過注意力感知圖神經網路的訓練,我們可以得到序列嵌入用以分析惡意程式之行為。此外,在調用類型資料集的家族分類實驗中,注意力感知圖神經網路的準確度優於其他分類器,且序列嵌入也能增進經典模型的表現。 zh_TW dc.description.abstract (摘要) Malicious software (malware) causes much harm to our devices and life. We are eager to understand the malware behavior and the threat it made. Most of the record files of malware are variable length and text-based files with time stamps, such as event log data and dynamic analysis profiles. Using the time stamps, we can sort such data into sequence-based data for the following analysis. However, dealing with the text-based sequences with variable lengths is difficult. In addition, unlike natural language text data, most sequential data in information security have specific properties and structure, such as loop, repeated call, noise, etc. To deeply analyze the API call sequences with their structure, we use graphs to represent the sequences, which can further investigate the information and structure, such as the Markov model. Therefore, we design and implement an Attention Aware Graph Neural Network (AWGCN) to analyze the API call sequences. Through AWGCN, we can obtain the sequence embeddings to analyze the behavior of the malware. Moreover, the classification experiment result shows that AWGCN outperforms other classifiers in the call-like datasets, and the embedding can further improve the classic model’s performance. en_US dc.description.tableofcontents Abstract i摘要 iiiContents ivList of Figures viList of Tables vii1 Introduction 12 Related Work 62.1 Text Embedding Algorithm 62.1.1 Bag-of-words and One-hot 62.1.2 Word2Vec 62.1.3 Doc2Vec 72.1.4 RNN 72.1.5 Transformer 72.2 Graph Neural Network 72.3 API Call Sequence 102.4 Markov Model 113 Design of Our Method 133.1 Overview 143.2 Preprocessing Module 143.3 Graph Generation Module 163.4 Graph Convolution Network Module 174 Experiment 204.1 Data Set 204.1.1 SynData 204.1.2 WinMal 214.1.3 Syscall 224.1.4 Oliveira 224.2 Family Classification Camparsion 244.3 Attention Mechanism 264.4 Representation of Malware Family 275 Discussion 336 Conclusion 34Reference 35 zh_TW dc.format.extent 2607954 bytes - dc.format.mimetype application/pdf - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0109356020 en_US dc.subject (關鍵詞) 圖神經網路 zh_TW dc.subject (關鍵詞) 注意力機制 zh_TW dc.subject (關鍵詞) 序列型資料 zh_TW dc.subject (關鍵詞) 馬可夫模型 zh_TW dc.subject (關鍵詞) Graph neural network en_US dc.subject (關鍵詞) Attention en_US dc.subject (關鍵詞) Sequential data en_US dc.subject (關鍵詞) Markov model en_US dc.title (題名) 基於圖神經網路提取惡意程式家族序列特徵 zh_TW dc.title (題名) Sequence Feature Extraction for Malware Family Analysis via Graph Neural Network en_US dc.type (資料類型) thesis en_US dc.relation.reference (參考文獻) T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.Q. Le and T. Mikolov, “Distributed representations of sentences and documents,” in International conference on machine learning. PMLR, 2014, pp. 1188–1196.A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.C. Beek, T. Dunton, J. Fokker, S. Grobman, T. Hux, T. Polzer, M. Rivero, T. Roccia, J. Saavedra-Morales, R. Samani et al., “Mcafee labs threats report: August 2019,” McAfee Labs, 2019.“Malware statistics &; trends report: Av-test.” [Online]. Available: https://www.av-test.org/en/statistics/malware/S. Alam, R. N. Horspool, I. Traore, and I. Sogukpinar, “A framework for metamorphic malware analysis and real-time detection,” computers & security, vol. 48, pp.212–233, 2015.M. Akbanov, V. G. Vassilakis, and M. D. Logothetis, “Wannacry ransomware: Analysis of infection, persistence, recovery prevention and propagation mechanisms,” Journal of Telecommunications and Information Technology, 2019.H. Sinanović and S. Mrdovic, “Analysis of mirai malicious software,” in 2017 25th International Conference on Software, Telecommunications and Computer Networks (SoftCOM), 2017, pp. 1–5.Y. Pan, X. Ge, C. Fang, and Y. Fan, “A systematic literature review of android malware detection using static analysis,” IEEE Access, vol. 8, pp. 116 363–116 379, 2020.M. Egele, T. Scholte, E. Kirda, and C. Kruegel, “A survey on automated dynamic malware-analysis techniques and tools,” ACM computing surveys (CSUR), vol. 44, no. 2, pp. 1–42, 2008.R. C. Edgar and S. Batzoglou, “Multiple sequence alignment,” Current opinion in structural biology, vol. 16, no. 3, pp. 368–373, 2006.R. Ronen, M. Radu, C. Feuerstein, E. Yom-Tov, and M. Ahmadi, “Microsoft malware classification challenge,” arXiv preprint arXiv:1802.10135, 2018.M. K. Shankarapani, S. Ramamoorthy, R. S. Movva, and S. Mukkamala, “Malware detection using assembly and api call sequences,” Journal in computer virology, vol. 7, no. 2, pp. 107–119, 2011.Y. Ki, E. Kim, and H. K. Kim, “A novel approach to detect malware based on api call sequence analysis,” International Journal of Distributed Sensor Networks, vol. 11, no. 6, p. 659101, 2015.“Bert (language model),” https://en.wikipedia.org/wiki/BERT_(language_model), accessed Jun. 26, 2022.S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.“Transformer (machine learning model),” https://en.wikipedia.org/wiki/Transformer_(machine_learning_model), accessed Jun. 26, 2022.J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.M. Zhang, Z. Cui, M. Neumann, and Y. Chen, “An end-to-end deep learning architecture for graph classification,” in Thirty-second AAAI conference on artificial intelligence, 2018.J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and locally connected networks on graphs,” 2013. [Online]. Available: https://arxiv.org/abs/1312.6203Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip, “A comprehensive survey on graph neural networks,” IEEE transactions on neural networks and learning systems, vol. 32, no. 1, pp. 4–24, 2020.T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” arXiv preprint arXiv:1710.10903, 2017.W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” Advances in neural information processing systems, vol. 30, 2017.M. Reddy, API Design for C++. Elsevier, 2011.J. Wulf and I. Blohm, “Fostering value creation with digital platforms: A unified theory of the application programming interface design,” Journal of Management Information Systems, vol. 37, no. 1, pp. 251–281, 2020.E. Amer and I. Zelinka, “A dynamic windows malware detection and prediction method based on contextual understanding of api call sequence,” Computers & Security, vol. 92, p. 101760, 2020.M. Alazab, S. Venkatraman, P. Watters, M. Alazab et al., “Zero-day malware detection based on supervised learning algorithms of api call signatures,” 2010.“Markov chain,” https://en.wikipedia.org/wiki/Markov_chain, accessed Jun. 26, 2022.N. C. for High-performance Computing(NCHC) and T. C. S. I. R. Team(TWCSIRT)., “Malware knowledge base,” https://owl.nchc.org.tw/about.php, accessed May. 22, 2022.S.-W. Hsiao and Y.-J. Lee, “Nn-based feature selection for text-based sequential data,” 2020.A. Oliveira, “Malware analysis datasets: Api call sequences,” 2019. [Online]. Available: https://dx.doi.org/10.21227/tqqm-aq14“Cuckoo,” https://cuckoosandbox.org/, accessed Jun. 28, 2022.“Adware.loadmoney,” https://blog.malwarebytes.com/detections/ adware-loadmoney/, accessed Jun. 29, 2022.“Adware.graftor,” https://blog.malwarebytes.com/detections/adware-graftor/, accessed Jun. 29, 2022. zh_TW dc.identifier.doi (DOI) 10.6814/NCCU202200886 en_US