Publications-Theses
Article View/Open
Publication Export
-
Google ScholarTM
NCCU Library
Citation Infomation
Related Publications in TAIR
題名 基於惡意軟體呼叫序列解耦潛在表示的多模態分類方法
Leveraging Disentangled Latent Representations of Malware Call Sequences for Multimodal Classification作者 謝東睿
Hsieh, Tung-Jui貢獻者 蕭舜文
Hsiao, Shun-Wen
謝東睿
Hsieh, Tung-Jui關鍵詞 多模態學習
解纏結表示學習
對比學習
惡意軟體分析
可解釋性
Multimodal learning
Disentangled representation
Contrastive learning
Malware analysis
Interpretability日期 2025 上傳時間 1-Sep-2025 15:06:00 (UTC+8) 摘要 複雜數據分析經常採用多模態表示策略,結合不同的數據模態或將單一數據源轉換為互補的觀點,以捕獲多樣化的行為面向。然而,理解和解釋這些模態間共享與獨特資訊代表一個根本性挑戰,直接影響模型的可解釋性、穩健性和泛化能力。為了解決這個挑戰,我們提出 DREAM(解纏結重建嵌入對抗式多模態學習),這是一個可解釋的解纏結表示學習框架,將每個模態分解為共同和私有子空間。該框架的核心創新在於透過對比學習採用實例級對齊,我們證明這對於學習有意義的共同表示至關重要,能夠捕獲真正的跨模態語義一致性,而非僅僅是分佈相似性。DREAM 提供有效的下游任務性能,同時透過歸因分析和表示聚類提供系統性方法來剖析和視覺化學習到的潛在結構。為了驗證我們框架的有效性和可解釋性,我們在惡意軟體分析中進行了全面的案例研究。該框架設計用於一般多模態應用,但此案例研究專注於多視圖學習場景,將單一來源的 API 呼叫序列轉換為順序、結構和視覺表示。我們在惡意軟體家族分類上的實驗結果表明,DREAM 在提供寶貴可解釋性洞察的同時達到競爭性能(84.3% 準確率)。該框架成功地將共同語義內容與模態特定特徵分離,歸因分析顯示家族分類的判別資訊主要存在於私有子空間中。此外,我們的方法透過識別不同家族間具有高共同空間相似性的樣本,能夠發現惡意軟體家族之間先前未知的關係。主要貢獻包括:(1) 用於多模態數據的通用可解釋解纏結表示學習框架,(2) 實例級對齊相較於分佈級方法優越性的理論和實證證明,(3) 多模態表示的系統性可解釋性分析方法論,以及 (4) 透過我們具挑戰性的案例研究為複雜行為序列分析提供的全面洞察。
Complex data analysis often employs multimodal representation strategies, combining different data modalities or transforming single data sources into complementary perspectives to capture diverse behavioral aspects. However, understanding and interpreting the shared versus unique information across these modalities represents a fundamental challenge that directly impacts model interpretability, robustness, and generalization.To address this challenge, we propose DREAM (Disentangled Reconstructive Embedding for Adversarial Multimodal learning), an interpretable framework for disentangled representation learning that decomposes each modality into common and private subspaces. The framework's core innovation lies in employing instance-level alignment through contrastive learning, which we demonstrate is crucial for learning meaningful common representations that capture genuine cross-modal semantic consistency rather than mere distributional similarities.DREAM provides effective downstream task performance while offering a systematic methodology to dissect and visualize learned latent structures through attribution analysis and representation clustering. To validate our framework's effectiveness and interpretability, we present a comprehensive case study in malware analysis. The framework is designed for general multimodal applications, but this case study focuses on a multi-view learning scenario where we transform single-source API call sequences into sequential, structural, and visual representations.Our experimental results on malware family classification demonstrate that DREAM achieves competitive performance (84.3% accuracy) while providing valuable interpretability insights. The framework successfully separates common semantic content from modality-specific characteristics, with attribution analysis revealing that discriminative information for family classification primarily resides in private subspaces. Furthermore, our approach enables the discovery of previously unknown relationships between malware families by identifying samples with high common-space similarity across different families.The primary contributions are: (1) a general interpretable disentangled representation learning framework for multimodal data, (2) theoretical and empirical demonstration of instance-level alignment's superiority over distribution-level methods, (3) a systematic interpretability analysis methodology for multimodal representations, and (4) comprehensive insights for complex behavioral sequence analysis through our challenging case study.參考文獻 Aboaoja, F. A., Ariffin, K. A. Z., Al-tahrawi, M. M., and Binti, S. (2022). Malware detection issues, challenges, and future directions: A survey. Applied Sciences, 12(17):8482. Alon, U., Zilberstein, M., Levy, O., and Yahav, E. (2019). code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages, 3(POPL):40:1–40:29. Baltrušaitis, T., Ahuja, C., and Morency, L.-P. (2019). Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2):423–443. Cabanettes, F. and Klopp, C. (2018). D-genies: Dot plot large genomes in an interactive, efficient and simple way. PeerJ, 6:e4958. Catak, F. O., Yazı, A. F., Elezaj, O., and Ahmed, J. (2020). Deep learning based sequential model for malware analysis using windows exe api calls. PeerJ Computer Science, 6:e285. Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, pages 1597–1607. PMLR. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations. Ficco, M. (2013). Security event correlation approach for cloud computing. International Journal of High Performance Computing and Networking, 7(3):173–185. Ficco, M. (2019). Detecting iot malware by markov chain fingerprinting. In Proceedings of the IEEE International Conference on Cloud Computing Technology and Science, pages 229–234. IEEE. Galal, H. S., Mahdy, Y. B., and Atiea, M. A. (2016). Behavior-based features model for malware detection. Journal of Computer Virology and Hacking Techniques, 12(2):59–67. Ganin, Y. and Lempitsky, V. (2015). Unsupervised domain adaptation by backpropagation. In Proceedings of the 32nd International Conference on Machine Learning, pages 1180–1189. PMLR. Ghadiya, R., Sharma, S., and Kumar, P. (2024). Cross-modal attention fusion for multimodal sentiment analysis. Pattern Recognition Letters, 178:45–52. Hazarika, D., Zimmermann, R., and Poria, S. (2020). Misa: Modality-invariant and -specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia, pages 1122–1131. ACM. He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9729–9738. IEEE. Hendrycks, D. and Gimpel, K. (2016). Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415. Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. (2017). beta-vae: Learning basic visual concepts with a constrained variational framework. In Proceedings of the 5th International Conference on Learning Representations. Huang, Y.-T., Lin, C. Y., Guo, Y.-R., Lo, K.-C., Sun, Y. S., and Chen, M. C. (2023). Open source intelligence for malicious behavior discovery and interpretation. In Proceedings of the IEEE 8th International Conference on Data Science and Advanced Analytics, pages 1–10. IEEE. Hwang, J. Y., Kwak, J., and Kim, T. (2020). Api sequence-based malware classification using support vector machine. In Proceedings of the International Conference on Information and Communication Technology Convergence, pages 1447–1449. IEEE. Jerlin, M. A. and Marimuthu, K. (2018). A new malware detection system using machine learning techniques for api call sequences. Journal of Applied Security Research, 13(1):45–62. Kingma, D. P. and Ba, J. (2015). Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations. Li, J., Wang, H., Chen, M., Liu, X., Zhang, Y., and Xu, G. (2024). A cross modal hierarchical fusion multimodal sentiment analysis method based on multi-task learning. Information Processing & Management, 61(3):103658. Li, X., Wu, C., Zhang, J., Wu, Z., and Wang, L. (2023). Attribute-driven disentangled representation learning for multimodal recommendation. arXiv preprint arXiv:2312.14433. Li, Y. and Zheng, J. (2021). A deep learning approach for api sequence analysis in malware detection. Journal of Computer Virology and Hacking Techniques, 17(2):129–142. Liu, F., Zhang, J., Liu, C., Tang, J., Li, X., and Wang, H. (2023a). Disentangled multimodal representation learning for recommendation. IEEE Transactions on Knowledge and Data Engineering, 35(8):8322–8335. Liu, J., Capurro, D., Nguyen, A., and Verspoor, K. (2023b). Attention-based multimodal fusion with contrast for robust clinical prediction in the face of missing modalities. Journal of Biomedical Informatics, 145:104466. Ma, J., Zhao, D., Gao, Y., Zhang, X., Wang, R., and Li, X. (2025). Multi-source multimodal domain adaptation. Information Sciences, 685:121307. Mathew, T. and Ajay Kumara, M. (2020). Api call sequencing based malware classification system using recurrent neural networks. In Proceedings of the International Conference on Data Science and Engineering, pages 117–122. IEEE. Mocanu, B., Tapu, R., and Zaharia, T. (2023). Multimodal emotion recognition using cross modal audio-video fusion with attention and deep metric learning. Image and Vision Computing, 133:104676. Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., and Sun, C. (2021). Attention bottlenecks for multimodal fusion. In Advances in Neural Information Processing Systems, volume 34, pages 14324–14336. Or-Meir, O., Nissim, N., Elovici, Y., and Rokach, L. (2019). Dynamic malware analysis in the modern era—a state of the art survey. ACM Computing Surveys, 52(5):1–48. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR. Robinet, L., Zhang, X., Lartizien, C., Niessen, W., and Klein, S. (2024). Drim: Learning disentangled representations from incomplete multimodal healthcare data. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 163–173. Springer. Sihwail, R., Omar, K., and Ariffin, K. A. Z. (2018). A survey on malware analysis techniques: Static, dynamic, hybrid and memory analysis. International Journal of Advanced Science, Engineering and Information Technology, 8(4-2):1662–1671. Sruthi, M. S. and Stamp, M. (2018). Actm: Api call transition matrices for malware detection. Journal of Computer Virology and Hacking Techniques, 14(4):297–314. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Wei, J., He, Y., Zhou, H., Zhang, Z., Chen, X., and Wang, X. (2024). Robust multimodal learning via representation decoupling. In European Conference on Computer Vision, pages 42–59. Springer. Wong, C. H., Huang, C. H., Guo, J. Y., Sun, Y., and Chen, M. (2023). Attention-based api locating for malware techniques. IEEE Transactions on Information Forensics and Security, 18:5971–5986. Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., and Yu, P. S. (2021). A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 32(1):4–24. Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., and Liu, T.-Y. (2020). On layer normalization in the transformer architecture. In Proceedings of the 37th International Conference on Machine Learning, pages 10524–10533. PMLR. Yang, Z., Huang, J., Kuang, K., Du, J., and Zhang, T. (2022). Disentangled representation learning for multimodal emotion recognition. In Proceedings of the 30th ACM International Conference on Multimedia, pages 1642–1651. ACM. Zadeh, A., Chen, M., Poria, S., Cambria, E., and Morency, L.-P. (2017). Tensor fusion network for multimodal sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1103–1114. Zellinger, W., Grubinger, T., Lughofer, E., Natschläger, T., and Saminger-Platz, S. (2017). Central moment discrepancy (cmd) for domain-invariant representation learning. In Proceedings of the 5th International Conference on Learning Representations. Zhang, Y., Wang, L., and Chen, H. (2024). Research on multi-attention-based multimodal fusion model. In Proceedings of the International Conference on Educational Innovation and Information Technology in Civil Engineering, pages 345–352. IEEE. Zhou, Y., Liang, X., Chen, H., Zhao, Y., Chen, X., and Yu, L. (2025). Triple disentangled representation learning for multimodal affective analysis. Information Fusion, 114:102663. 描述 碩士
國立政治大學
資訊管理學系
112356045資料來源 http://thesis.lib.nccu.edu.tw/record/#G0112356045 資料類型 thesis dc.contributor.advisor 蕭舜文 zh_TW dc.contributor.advisor Hsiao, Shun-Wen en_US dc.contributor.author (Authors) 謝東睿 zh_TW dc.contributor.author (Authors) Hsieh, Tung-Jui en_US dc.creator (作者) 謝東睿 zh_TW dc.creator (作者) Hsieh, Tung-Jui en_US dc.date (日期) 2025 en_US dc.date.accessioned 1-Sep-2025 15:06:00 (UTC+8) - dc.date.available 1-Sep-2025 15:06:00 (UTC+8) - dc.date.issued (上傳時間) 1-Sep-2025 15:06:00 (UTC+8) - dc.identifier (Other Identifiers) G0112356045 en_US dc.identifier.uri (URI) https://nccur.lib.nccu.edu.tw/handle/140.119/159099 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 資訊管理學系 zh_TW dc.description (描述) 112356045 zh_TW dc.description.abstract (摘要) 複雜數據分析經常採用多模態表示策略,結合不同的數據模態或將單一數據源轉換為互補的觀點,以捕獲多樣化的行為面向。然而,理解和解釋這些模態間共享與獨特資訊代表一個根本性挑戰,直接影響模型的可解釋性、穩健性和泛化能力。為了解決這個挑戰,我們提出 DREAM(解纏結重建嵌入對抗式多模態學習),這是一個可解釋的解纏結表示學習框架,將每個模態分解為共同和私有子空間。該框架的核心創新在於透過對比學習採用實例級對齊,我們證明這對於學習有意義的共同表示至關重要,能夠捕獲真正的跨模態語義一致性,而非僅僅是分佈相似性。DREAM 提供有效的下游任務性能,同時透過歸因分析和表示聚類提供系統性方法來剖析和視覺化學習到的潛在結構。為了驗證我們框架的有效性和可解釋性,我們在惡意軟體分析中進行了全面的案例研究。該框架設計用於一般多模態應用,但此案例研究專注於多視圖學習場景,將單一來源的 API 呼叫序列轉換為順序、結構和視覺表示。我們在惡意軟體家族分類上的實驗結果表明,DREAM 在提供寶貴可解釋性洞察的同時達到競爭性能(84.3% 準確率)。該框架成功地將共同語義內容與模態特定特徵分離,歸因分析顯示家族分類的判別資訊主要存在於私有子空間中。此外,我們的方法透過識別不同家族間具有高共同空間相似性的樣本,能夠發現惡意軟體家族之間先前未知的關係。主要貢獻包括:(1) 用於多模態數據的通用可解釋解纏結表示學習框架,(2) 實例級對齊相較於分佈級方法優越性的理論和實證證明,(3) 多模態表示的系統性可解釋性分析方法論,以及 (4) 透過我們具挑戰性的案例研究為複雜行為序列分析提供的全面洞察。 zh_TW dc.description.abstract (摘要) Complex data analysis often employs multimodal representation strategies, combining different data modalities or transforming single data sources into complementary perspectives to capture diverse behavioral aspects. However, understanding and interpreting the shared versus unique information across these modalities represents a fundamental challenge that directly impacts model interpretability, robustness, and generalization.To address this challenge, we propose DREAM (Disentangled Reconstructive Embedding for Adversarial Multimodal learning), an interpretable framework for disentangled representation learning that decomposes each modality into common and private subspaces. The framework's core innovation lies in employing instance-level alignment through contrastive learning, which we demonstrate is crucial for learning meaningful common representations that capture genuine cross-modal semantic consistency rather than mere distributional similarities.DREAM provides effective downstream task performance while offering a systematic methodology to dissect and visualize learned latent structures through attribution analysis and representation clustering. To validate our framework's effectiveness and interpretability, we present a comprehensive case study in malware analysis. The framework is designed for general multimodal applications, but this case study focuses on a multi-view learning scenario where we transform single-source API call sequences into sequential, structural, and visual representations.Our experimental results on malware family classification demonstrate that DREAM achieves competitive performance (84.3% accuracy) while providing valuable interpretability insights. The framework successfully separates common semantic content from modality-specific characteristics, with attribution analysis revealing that discriminative information for family classification primarily resides in private subspaces. Furthermore, our approach enables the discovery of previously unknown relationships between malware families by identifying samples with high common-space similarity across different families.The primary contributions are: (1) a general interpretable disentangled representation learning framework for multimodal data, (2) theoretical and empirical demonstration of instance-level alignment's superiority over distribution-level methods, (3) a systematic interpretability analysis methodology for multimodal representations, and (4) comprehensive insights for complex behavioral sequence analysis through our challenging case study. en_US dc.description.tableofcontents 摘要 i Abstract iii Contents v List of Figures ix List of Tables xii 1 Introduction 1 2 Proposed Method 5 2.1 Overview 5 2.2 Problem Formulation 6 2.3 Multimodal Data Representation 7 2.3.1 Markov Chain Transition Graph Construction 7 2.3.2 Co-occurrence Matrix Image Generation 8 2.3.3 Sequential Text Representation 8 3 DREAM Architecture and Algorithm 10 3.1 Feature Extraction Architecture 10 3.1.1 Text Modality Processing 10 3.1.2 Hybrid Graph Modality Processing 11 3.1.3 Vision Transformer for Image Modality Processing 12 3.2 Disentangled Representation Learning: Objectives and Architecture 14 3.2.1 Common and Private Encoders 14 3.2.2 Adversarial Disentanglement 15 3.2.3 Instance-level Alignment 16 3.2.4 Structural Constraints: Disparity and Reconstruction 17 4 Analysis and Discussion 19 4.1 Performance Benchmarking and Ablation Studies 19 4.2 Interpretability through Attribution Analysis 20 4.2.1 Dissecting Common and Private Representations 21 4.2.2 Validating Family-level Semantics in Common Space 21 4.2.3 Explaining Cross-Family Similarities 22 4.3 Discussion 22 4.3.1 Insights from Disentangled Representations 23 4.3.2 Methodological Implications for Multi-view Learning 23 4.3.3 Limitations and Future Work 24 4.3.4 Attribution Analysis Figures 25 5 Implementation 27 5.1 Downstream Task Application: Fusion and Classification 27 5.1.1 Cross-Modal Attention Fusion 27 5.1.2 Final Classification 29 5.2 Overall Objective and Implementation Details 30 5.2.1 Overall Training Objective 30 5.2.2 Implementation Details 30 6 Experiments 33 6.1 Experimental Setup 33 6.1.1 Dataset 33 6.1.2 Data Preprocessing and Configuration 33 6.1.3 Experimental Configuration 34 6.2 Model Training and Performance Validation 35 6.2.1 Training Dynamics and Convergence 35 6.2.2 Overall Downstream Task Performance 36 6.2.3 Per-Family Performance Analysis 36 6.3 Analysis of Disentangled Representation Structure 37 6.3.1 Qualitative Analysis of Latent Spaces 37 6.3.2 Component Effectiveness and Information Gain 39 6.3.3 Efficacy of the Disentanglement Process 40 7 Related Work 43 7.1 API Call Sequence Analysis 43 7.2 Disentangled Representation Learning 44 7.3 Representation Alignment in Multi-view Learning 45 7.3.1 Distribution-level Alignment Methods 45 7.3.2 Instance-level Alignment Methods 46 7.4 Adversarial Learning and Domain Adaptation 48 7.5 Multimodal Fusion Techniques 49 7.5.1 Attention-Based Fusion Methods 49 7.5.2 Advanced Fusion Architectures 50 8 Conclusion 51 A Dataset and Experimental Setup 53 A.1 WinMal Dataset Overview 53 A.1.1 Data Preprocessing and Configuration 54 A.1.2 Experimental Configuration 54 A.2 Model Training and Performance Validation 54 A.2.1 Training Dynamics and Convergence 54 A.2.2 Overall Downstream Task Performance 55 A.2.3 Evaluation Metrics 55 A.3 Baseline Methods and Comparison 56 A.4 Implementation Details 57 Reference 58 zh_TW dc.format.extent 5414641 bytes - dc.format.mimetype application/pdf - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0112356045 en_US dc.subject (關鍵詞) 多模態學習 zh_TW dc.subject (關鍵詞) 解纏結表示學習 zh_TW dc.subject (關鍵詞) 對比學習 zh_TW dc.subject (關鍵詞) 惡意軟體分析 zh_TW dc.subject (關鍵詞) 可解釋性 zh_TW dc.subject (關鍵詞) Multimodal learning en_US dc.subject (關鍵詞) Disentangled representation en_US dc.subject (關鍵詞) Contrastive learning en_US dc.subject (關鍵詞) Malware analysis en_US dc.subject (關鍵詞) Interpretability en_US dc.title (題名) 基於惡意軟體呼叫序列解耦潛在表示的多模態分類方法 zh_TW dc.title (題名) Leveraging Disentangled Latent Representations of Malware Call Sequences for Multimodal Classification en_US dc.type (資料類型) thesis en_US dc.relation.reference (參考文獻) Aboaoja, F. A., Ariffin, K. A. Z., Al-tahrawi, M. M., and Binti, S. (2022). Malware detection issues, challenges, and future directions: A survey. Applied Sciences, 12(17):8482. Alon, U., Zilberstein, M., Levy, O., and Yahav, E. (2019). code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages, 3(POPL):40:1–40:29. Baltrušaitis, T., Ahuja, C., and Morency, L.-P. (2019). Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2):423–443. Cabanettes, F. and Klopp, C. (2018). D-genies: Dot plot large genomes in an interactive, efficient and simple way. PeerJ, 6:e4958. Catak, F. O., Yazı, A. F., Elezaj, O., and Ahmed, J. (2020). Deep learning based sequential model for malware analysis using windows exe api calls. PeerJ Computer Science, 6:e285. Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, pages 1597–1607. PMLR. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations. Ficco, M. (2013). Security event correlation approach for cloud computing. International Journal of High Performance Computing and Networking, 7(3):173–185. Ficco, M. (2019). Detecting iot malware by markov chain fingerprinting. In Proceedings of the IEEE International Conference on Cloud Computing Technology and Science, pages 229–234. IEEE. Galal, H. S., Mahdy, Y. B., and Atiea, M. A. (2016). Behavior-based features model for malware detection. Journal of Computer Virology and Hacking Techniques, 12(2):59–67. Ganin, Y. and Lempitsky, V. (2015). Unsupervised domain adaptation by backpropagation. In Proceedings of the 32nd International Conference on Machine Learning, pages 1180–1189. PMLR. Ghadiya, R., Sharma, S., and Kumar, P. (2024). Cross-modal attention fusion for multimodal sentiment analysis. Pattern Recognition Letters, 178:45–52. Hazarika, D., Zimmermann, R., and Poria, S. (2020). Misa: Modality-invariant and -specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia, pages 1122–1131. ACM. He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9729–9738. IEEE. Hendrycks, D. and Gimpel, K. (2016). Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415. Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. (2017). beta-vae: Learning basic visual concepts with a constrained variational framework. In Proceedings of the 5th International Conference on Learning Representations. Huang, Y.-T., Lin, C. Y., Guo, Y.-R., Lo, K.-C., Sun, Y. S., and Chen, M. C. (2023). Open source intelligence for malicious behavior discovery and interpretation. In Proceedings of the IEEE 8th International Conference on Data Science and Advanced Analytics, pages 1–10. IEEE. Hwang, J. Y., Kwak, J., and Kim, T. (2020). Api sequence-based malware classification using support vector machine. In Proceedings of the International Conference on Information and Communication Technology Convergence, pages 1447–1449. IEEE. Jerlin, M. A. and Marimuthu, K. (2018). A new malware detection system using machine learning techniques for api call sequences. Journal of Applied Security Research, 13(1):45–62. Kingma, D. P. and Ba, J. (2015). Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations. Li, J., Wang, H., Chen, M., Liu, X., Zhang, Y., and Xu, G. (2024). A cross modal hierarchical fusion multimodal sentiment analysis method based on multi-task learning. Information Processing & Management, 61(3):103658. Li, X., Wu, C., Zhang, J., Wu, Z., and Wang, L. (2023). Attribute-driven disentangled representation learning for multimodal recommendation. arXiv preprint arXiv:2312.14433. Li, Y. and Zheng, J. (2021). A deep learning approach for api sequence analysis in malware detection. Journal of Computer Virology and Hacking Techniques, 17(2):129–142. Liu, F., Zhang, J., Liu, C., Tang, J., Li, X., and Wang, H. (2023a). Disentangled multimodal representation learning for recommendation. IEEE Transactions on Knowledge and Data Engineering, 35(8):8322–8335. Liu, J., Capurro, D., Nguyen, A., and Verspoor, K. (2023b). Attention-based multimodal fusion with contrast for robust clinical prediction in the face of missing modalities. Journal of Biomedical Informatics, 145:104466. Ma, J., Zhao, D., Gao, Y., Zhang, X., Wang, R., and Li, X. (2025). Multi-source multimodal domain adaptation. Information Sciences, 685:121307. Mathew, T. and Ajay Kumara, M. (2020). Api call sequencing based malware classification system using recurrent neural networks. In Proceedings of the International Conference on Data Science and Engineering, pages 117–122. IEEE. Mocanu, B., Tapu, R., and Zaharia, T. (2023). Multimodal emotion recognition using cross modal audio-video fusion with attention and deep metric learning. Image and Vision Computing, 133:104676. Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., and Sun, C. (2021). Attention bottlenecks for multimodal fusion. In Advances in Neural Information Processing Systems, volume 34, pages 14324–14336. Or-Meir, O., Nissim, N., Elovici, Y., and Rokach, L. (2019). Dynamic malware analysis in the modern era—a state of the art survey. ACM Computing Surveys, 52(5):1–48. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR. Robinet, L., Zhang, X., Lartizien, C., Niessen, W., and Klein, S. (2024). Drim: Learning disentangled representations from incomplete multimodal healthcare data. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 163–173. Springer. Sihwail, R., Omar, K., and Ariffin, K. A. Z. (2018). A survey on malware analysis techniques: Static, dynamic, hybrid and memory analysis. International Journal of Advanced Science, Engineering and Information Technology, 8(4-2):1662–1671. Sruthi, M. S. and Stamp, M. (2018). Actm: Api call transition matrices for malware detection. Journal of Computer Virology and Hacking Techniques, 14(4):297–314. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Wei, J., He, Y., Zhou, H., Zhang, Z., Chen, X., and Wang, X. (2024). Robust multimodal learning via representation decoupling. In European Conference on Computer Vision, pages 42–59. Springer. Wong, C. H., Huang, C. H., Guo, J. Y., Sun, Y., and Chen, M. (2023). Attention-based api locating for malware techniques. IEEE Transactions on Information Forensics and Security, 18:5971–5986. Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., and Yu, P. S. (2021). A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 32(1):4–24. Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., and Liu, T.-Y. (2020). On layer normalization in the transformer architecture. In Proceedings of the 37th International Conference on Machine Learning, pages 10524–10533. PMLR. Yang, Z., Huang, J., Kuang, K., Du, J., and Zhang, T. (2022). Disentangled representation learning for multimodal emotion recognition. In Proceedings of the 30th ACM International Conference on Multimedia, pages 1642–1651. ACM. Zadeh, A., Chen, M., Poria, S., Cambria, E., and Morency, L.-P. (2017). Tensor fusion network for multimodal sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1103–1114. Zellinger, W., Grubinger, T., Lughofer, E., Natschläger, T., and Saminger-Platz, S. (2017). Central moment discrepancy (cmd) for domain-invariant representation learning. In Proceedings of the 5th International Conference on Learning Representations. Zhang, Y., Wang, L., and Chen, H. (2024). Research on multi-attention-based multimodal fusion model. In Proceedings of the International Conference on Educational Innovation and Information Technology in Civil Engineering, pages 345–352. IEEE. Zhou, Y., Liang, X., Chen, H., Zhao, Y., Chen, X., and Yu, L. (2025). Triple disentangled representation learning for multimodal affective analysis. Information Fusion, 114:102663. zh_TW
