學術產出-Theses

Article View/Open

Publication Export

Google ScholarTM

政大圖書館

Citation Infomation

題名 基於圖神經網路提取惡意程式家族序列特徵
Sequence Feature Extraction for Malware Family Analysis via Graph Neural Network
作者 朱柏瑜
Chu, Po-Yu
貢獻者 蕭舜文
Hsiao, Shun-Wen
朱柏瑜
Chu, Po-Yu
關鍵詞 圖神經網路
注意力機制
序列型資料
馬可夫模型
Graph neural network
Attention
Sequential data
Markov model
日期 2022
上傳時間 1-Aug-2022 17:22:11 (UTC+8)
摘要 由於惡意程式對我們的生活及電子裝置帶來許多危害,因此我們迫切的想了解惡意程式的行為及他們可能造成的危害。惡意程式所產生的紀錄檔大多是帶有時間戳記的不定長度文字型資料,像是事件紀錄檔或是動態分析紀錄檔。我們可以利用時間戳記將紀錄檔排序成序列型資料以利後續分析。然而,要處理這種可變長度的文字型序列資料是非常困難的。除此之外,在資訊安全領域中大多數的序列型資料都有特殊的屬性或是結構,例如:迴圈、重複調用及雜訊等自然語言中不會有的特性與結構。為了深入分析應用程式介面(API)調用序列及結構,本研究使用圖(如馬可夫模型)來深究隱含在序列中的資訊與結構。因此本研究設計並實作了注意力感知圖神經網路(AWGCN)來分析應用程式介面調用序列。透過注意力感知圖神經網路的訓練,我們可以得到序列嵌入用以分析惡意程式之行為。此外,在調用類型資料集的家族分類實驗中,注意力感知圖神經網路的準確度優於其他分類器,且序列嵌入也能增進經典模型的表現。
Malicious software (malware) causes much harm to our devices and life. We are eager to understand the malware behavior and the threat it made. Most of the record files of malware are variable length and text-based files with time stamps, such as event log data and dynamic analysis profiles. Using the time stamps, we can sort such data into sequence-based data for the following analysis. However, dealing with the text-based sequences with variable lengths is difficult. In addition, unlike natural language text data, most sequential data in information security have specific properties and structure, such as loop, repeated call, noise, etc. To deeply analyze the API call sequences with their structure, we use graphs to represent the sequences, which can further investigate the information and structure, such as the Markov model. Therefore, we design and implement an Attention Aware Graph Neural Network (AWGCN) to analyze the API call sequences. Through AWGCN, we can obtain the sequence embeddings to analyze the behavior of the malware. Moreover, the classification experiment result shows that AWGCN outperforms other classifiers in the call-like datasets, and the embedding can further improve the classic model’s performance.
參考文獻 T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.

Q. Le and T. Mikolov, “Distributed representations of sentences and documents,” in International conference on machine learning. PMLR, 2014, pp. 1188–1196.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.

C. Beek, T. Dunton, J. Fokker, S. Grobman, T. Hux, T. Polzer, M. Rivero, T. Roccia, J. Saavedra-Morales, R. Samani et al., “Mcafee labs threats report: August 2019,” McAfee Labs, 2019.

“Malware statistics &; trends report: Av-test.” [Online]. Available: https://www.av-test.org/en/statistics/malware/

S. Alam, R. N. Horspool, I. Traore, and I. Sogukpinar, “A framework for metamorphic malware analysis and real-time detection,” computers & security, vol. 48, pp.212–233, 2015.

M. Akbanov, V. G. Vassilakis, and M. D. Logothetis, “Wannacry ransomware: Analysis of infection, persistence, recovery prevention and propagation mechanisms,” Journal of Telecommunications and Information Technology, 2019.

H. Sinanović and S. Mrdovic, “Analysis of mirai malicious software,” in 2017 25th International Conference on Software, Telecommunications and Computer Networks (SoftCOM), 2017, pp. 1–5.

Y. Pan, X. Ge, C. Fang, and Y. Fan, “A systematic literature review of android malware detection using static analysis,” IEEE Access, vol. 8, pp. 116 363–116 379, 2020.

M. Egele, T. Scholte, E. Kirda, and C. Kruegel, “A survey on automated dynamic malware-analysis techniques and tools,” ACM computing surveys (CSUR), vol. 44, no. 2, pp. 1–42, 2008.

R. C. Edgar and S. Batzoglou, “Multiple sequence alignment,” Current opinion in structural biology, vol. 16, no. 3, pp. 368–373, 2006.

R. Ronen, M. Radu, C. Feuerstein, E. Yom-Tov, and M. Ahmadi, “Microsoft malware classification challenge,” arXiv preprint arXiv:1802.10135, 2018.

M. K. Shankarapani, S. Ramamoorthy, R. S. Movva, and S. Mukkamala, “Malware detection using assembly and api call sequences,” Journal in computer virology, vol. 7, no. 2, pp. 107–119, 2011.

Y. Ki, E. Kim, and H. K. Kim, “A novel approach to detect malware based on api call sequence analysis,” International Journal of Distributed Sensor Networks, vol. 11, no. 6, p. 659101, 2015.

“Bert (language model),” https://en.wikipedia.org/wiki/BERT_(language_model), accessed Jun. 26, 2022.

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.

“Transformer (machine learning model),” https://en.wikipedia.org/wiki/Transformer_(machine_learning_model), accessed Jun. 26, 2022.

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.

M. Zhang, Z. Cui, M. Neumann, and Y. Chen, “An end-to-end deep learning architecture for graph classification,” in Thirty-second AAAI conference on artificial intelligence, 2018.

J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and locally connected networks on graphs,” 2013. [Online]. Available: https://arxiv.org/abs/1312.6203

Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip, “A comprehensive survey on graph neural networks,” IEEE transactions on neural networks and learning systems, vol. 32, no. 1, pp. 4–24, 2020.

T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.

P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” arXiv preprint arXiv:1710.10903, 2017.

W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” Advances in neural information processing systems, vol. 30, 2017.

M. Reddy, API Design for C++. Elsevier, 2011.

J. Wulf and I. Blohm, “Fostering value creation with digital platforms: A unified theory of the application programming interface design,” Journal of Management Information Systems, vol. 37, no. 1, pp. 251–281, 2020.

E. Amer and I. Zelinka, “A dynamic windows malware detection and prediction method based on contextual understanding of api call sequence,” Computers & Security, vol. 92, p. 101760, 2020.

M. Alazab, S. Venkatraman, P. Watters, M. Alazab et al., “Zero-day malware detection based on supervised learning algorithms of api call signatures,” 2010.

“Markov chain,” https://en.wikipedia.org/wiki/Markov_chain, accessed Jun. 26, 2022.

N. C. for High-performance Computing(NCHC) and T. C. S. I. R. Team(TWCSIRT)., “Malware knowledge base,” https://owl.nchc.org.tw/about.php, accessed May. 22, 2022.

S.-W. Hsiao and Y.-J. Lee, “Nn-based feature selection for text-based sequential data,” 2020.

A. Oliveira, “Malware analysis datasets: Api call sequences,” 2019. [Online]. Available: https://dx.doi.org/10.21227/tqqm-aq14

“Cuckoo,” https://cuckoosandbox.org/, accessed Jun. 28, 2022.

“Adware.loadmoney,” https://blog.malwarebytes.com/detections/ adware-loadmoney/, accessed Jun. 29, 2022.

“Adware.graftor,” https://blog.malwarebytes.com/detections/adware-graftor/, accessed Jun. 29, 2022.
描述 碩士
國立政治大學
資訊管理學系
109356020
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0109356020
資料類型 thesis
dc.contributor.advisor 蕭舜文zh_TW
dc.contributor.advisor Hsiao, Shun-Wenen_US
dc.contributor.author (Authors) 朱柏瑜zh_TW
dc.contributor.author (Authors) Chu, Po-Yuen_US
dc.creator (作者) 朱柏瑜zh_TW
dc.creator (作者) Chu, Po-Yuen_US
dc.date (日期) 2022en_US
dc.date.accessioned 1-Aug-2022 17:22:11 (UTC+8)-
dc.date.available 1-Aug-2022 17:22:11 (UTC+8)-
dc.date.issued (上傳時間) 1-Aug-2022 17:22:11 (UTC+8)-
dc.identifier (Other Identifiers) G0109356020en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/141035-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊管理學系zh_TW
dc.description (描述) 109356020zh_TW
dc.description.abstract (摘要) 由於惡意程式對我們的生活及電子裝置帶來許多危害,因此我們迫切的想了解惡意程式的行為及他們可能造成的危害。惡意程式所產生的紀錄檔大多是帶有時間戳記的不定長度文字型資料,像是事件紀錄檔或是動態分析紀錄檔。我們可以利用時間戳記將紀錄檔排序成序列型資料以利後續分析。然而,要處理這種可變長度的文字型序列資料是非常困難的。除此之外,在資訊安全領域中大多數的序列型資料都有特殊的屬性或是結構,例如:迴圈、重複調用及雜訊等自然語言中不會有的特性與結構。為了深入分析應用程式介面(API)調用序列及結構,本研究使用圖(如馬可夫模型)來深究隱含在序列中的資訊與結構。因此本研究設計並實作了注意力感知圖神經網路(AWGCN)來分析應用程式介面調用序列。透過注意力感知圖神經網路的訓練,我們可以得到序列嵌入用以分析惡意程式之行為。此外,在調用類型資料集的家族分類實驗中,注意力感知圖神經網路的準確度優於其他分類器,且序列嵌入也能增進經典模型的表現。zh_TW
dc.description.abstract (摘要) Malicious software (malware) causes much harm to our devices and life. We are eager to understand the malware behavior and the threat it made. Most of the record files of malware are variable length and text-based files with time stamps, such as event log data and dynamic analysis profiles. Using the time stamps, we can sort such data into sequence-based data for the following analysis. However, dealing with the text-based sequences with variable lengths is difficult. In addition, unlike natural language text data, most sequential data in information security have specific properties and structure, such as loop, repeated call, noise, etc. To deeply analyze the API call sequences with their structure, we use graphs to represent the sequences, which can further investigate the information and structure, such as the Markov model. Therefore, we design and implement an Attention Aware Graph Neural Network (AWGCN) to analyze the API call sequences. Through AWGCN, we can obtain the sequence embeddings to analyze the behavior of the malware. Moreover, the classification experiment result shows that AWGCN outperforms other classifiers in the call-like datasets, and the embedding can further improve the classic model’s performance.en_US
dc.description.tableofcontents Abstract i
摘要 iii
Contents iv
List of Figures vi
List of Tables vii
1 Introduction 1
2 Related Work 6
2.1 Text Embedding Algorithm 6
2.1.1 Bag-of-words and One-hot 6
2.1.2 Word2Vec 6
2.1.3 Doc2Vec 7
2.1.4 RNN 7
2.1.5 Transformer 7
2.2 Graph Neural Network 7
2.3 API Call Sequence 10
2.4 Markov Model 11
3 Design of Our Method 13
3.1 Overview 14
3.2 Preprocessing Module 14
3.3 Graph Generation Module 16
3.4 Graph Convolution Network Module 17
4 Experiment 20
4.1 Data Set 20
4.1.1 SynData 20
4.1.2 WinMal 21
4.1.3 Syscall 22
4.1.4 Oliveira 22
4.2 Family Classification Camparsion 24
4.3 Attention Mechanism 26
4.4 Representation of Malware Family 27
5 Discussion 33
6 Conclusion 34
Reference 35
zh_TW
dc.format.extent 2607954 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0109356020en_US
dc.subject (關鍵詞) 圖神經網路zh_TW
dc.subject (關鍵詞) 注意力機制zh_TW
dc.subject (關鍵詞) 序列型資料zh_TW
dc.subject (關鍵詞) 馬可夫模型zh_TW
dc.subject (關鍵詞) Graph neural networken_US
dc.subject (關鍵詞) Attentionen_US
dc.subject (關鍵詞) Sequential dataen_US
dc.subject (關鍵詞) Markov modelen_US
dc.title (題名) 基於圖神經網路提取惡意程式家族序列特徵zh_TW
dc.title (題名) Sequence Feature Extraction for Malware Family Analysis via Graph Neural Networken_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.

Q. Le and T. Mikolov, “Distributed representations of sentences and documents,” in International conference on machine learning. PMLR, 2014, pp. 1188–1196.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.

C. Beek, T. Dunton, J. Fokker, S. Grobman, T. Hux, T. Polzer, M. Rivero, T. Roccia, J. Saavedra-Morales, R. Samani et al., “Mcafee labs threats report: August 2019,” McAfee Labs, 2019.

“Malware statistics &; trends report: Av-test.” [Online]. Available: https://www.av-test.org/en/statistics/malware/

S. Alam, R. N. Horspool, I. Traore, and I. Sogukpinar, “A framework for metamorphic malware analysis and real-time detection,” computers & security, vol. 48, pp.212–233, 2015.

M. Akbanov, V. G. Vassilakis, and M. D. Logothetis, “Wannacry ransomware: Analysis of infection, persistence, recovery prevention and propagation mechanisms,” Journal of Telecommunications and Information Technology, 2019.

H. Sinanović and S. Mrdovic, “Analysis of mirai malicious software,” in 2017 25th International Conference on Software, Telecommunications and Computer Networks (SoftCOM), 2017, pp. 1–5.

Y. Pan, X. Ge, C. Fang, and Y. Fan, “A systematic literature review of android malware detection using static analysis,” IEEE Access, vol. 8, pp. 116 363–116 379, 2020.

M. Egele, T. Scholte, E. Kirda, and C. Kruegel, “A survey on automated dynamic malware-analysis techniques and tools,” ACM computing surveys (CSUR), vol. 44, no. 2, pp. 1–42, 2008.

R. C. Edgar and S. Batzoglou, “Multiple sequence alignment,” Current opinion in structural biology, vol. 16, no. 3, pp. 368–373, 2006.

R. Ronen, M. Radu, C. Feuerstein, E. Yom-Tov, and M. Ahmadi, “Microsoft malware classification challenge,” arXiv preprint arXiv:1802.10135, 2018.

M. K. Shankarapani, S. Ramamoorthy, R. S. Movva, and S. Mukkamala, “Malware detection using assembly and api call sequences,” Journal in computer virology, vol. 7, no. 2, pp. 107–119, 2011.

Y. Ki, E. Kim, and H. K. Kim, “A novel approach to detect malware based on api call sequence analysis,” International Journal of Distributed Sensor Networks, vol. 11, no. 6, p. 659101, 2015.

“Bert (language model),” https://en.wikipedia.org/wiki/BERT_(language_model), accessed Jun. 26, 2022.

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.

“Transformer (machine learning model),” https://en.wikipedia.org/wiki/Transformer_(machine_learning_model), accessed Jun. 26, 2022.

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.

M. Zhang, Z. Cui, M. Neumann, and Y. Chen, “An end-to-end deep learning architecture for graph classification,” in Thirty-second AAAI conference on artificial intelligence, 2018.

J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and locally connected networks on graphs,” 2013. [Online]. Available: https://arxiv.org/abs/1312.6203

Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip, “A comprehensive survey on graph neural networks,” IEEE transactions on neural networks and learning systems, vol. 32, no. 1, pp. 4–24, 2020.

T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.

P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” arXiv preprint arXiv:1710.10903, 2017.

W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” Advances in neural information processing systems, vol. 30, 2017.

M. Reddy, API Design for C++. Elsevier, 2011.

J. Wulf and I. Blohm, “Fostering value creation with digital platforms: A unified theory of the application programming interface design,” Journal of Management Information Systems, vol. 37, no. 1, pp. 251–281, 2020.

E. Amer and I. Zelinka, “A dynamic windows malware detection and prediction method based on contextual understanding of api call sequence,” Computers & Security, vol. 92, p. 101760, 2020.

M. Alazab, S. Venkatraman, P. Watters, M. Alazab et al., “Zero-day malware detection based on supervised learning algorithms of api call signatures,” 2010.

“Markov chain,” https://en.wikipedia.org/wiki/Markov_chain, accessed Jun. 26, 2022.

N. C. for High-performance Computing(NCHC) and T. C. S. I. R. Team(TWCSIRT)., “Malware knowledge base,” https://owl.nchc.org.tw/about.php, accessed May. 22, 2022.

S.-W. Hsiao and Y.-J. Lee, “Nn-based feature selection for text-based sequential data,” 2020.

A. Oliveira, “Malware analysis datasets: Api call sequences,” 2019. [Online]. Available: https://dx.doi.org/10.21227/tqqm-aq14

“Cuckoo,” https://cuckoosandbox.org/, accessed Jun. 28, 2022.

“Adware.loadmoney,” https://blog.malwarebytes.com/detections/ adware-loadmoney/, accessed Jun. 29, 2022.

“Adware.graftor,” https://blog.malwarebytes.com/detections/adware-graftor/, accessed Jun. 29, 2022.
zh_TW
dc.identifier.doi (DOI) 10.6814/NCCU202200886en_US