CapST: Leveraging Capsule Networks and Temporal Attention for Accurate Model Attribution in Deep-fake Videos | Publication | NCCU Academic Hub

Publications-Periodical Articles

Article View/Open

html(268)

Publication Export

Google Scholar^TM

NCCU Library

Discovery System

Citation Infomation

Related Publications in TAIR

Simple Record
Full Record

題名	CapST: Leveraging Capsule Networks and Temporal Attention for Accurate Model Attribution in Deep-fake Videos
作者	汪新 Ahmad, Wasim;Peng, Yan Tsung;Chang, Yuan-Hao;Ganfure, Gaddisa Olani;Khan, Sarwar
貢獻者	群智博五
日期	2025-04
上傳時間	27-May-2025 11:09:35 (UTC+8)
摘要	Deep-fake videos, generated through AI face-swapping techniques, have garnered considerable attention due to their potential for impactful impersonation attacks. While existing research primarily distinguishes real from fake videos, attributing a deep-fake to its specific generation model or encoder is crucial for forensic investigation, enabling precise source tracing and tailored countermeasures. This approach not only enhances detection accuracy by leveraging unique model-specific artifacts but also provides insights essential for developing proactive defenses against evolving deep-fake techniques. Addressing this gap, this article investigates the model attribution problem for deep-fake videos using two datasets—Deepfakes from Different Models (DFDM) and GANGen-Detection, which comprise deep-fake videos and images generated by GAN models. We select only fake images from the GANGen-Detection dataset to align with the DFDM dataset, which specifies the goal of this study, focusing on model attribution rather than real/fake classification. This study formulates deep-fake model attribution as a multiclass classification task, introducing a novel Capsule-Spatial-Temporal (CapST) model that effectively integrates a modified VGG19 (utilizing only the first 26 out of 52 layers) for feature extraction, combined with Capsule Networks and a Spatio-Temporal attention mechanism. The Capsule module captures intricate feature hierarchies, enabling robust identification of deep-fake attributes, while a video-level fusion technique leverages temporal attention mechanisms to process concatenated feature vectors and capture temporal dependencies in deep-fake videos. By aggregating insights across frames, our model achieves a comprehensive understanding of video content, resulting in more precise predictions. Experimental results on the DFDM and GANGen-Detection datasets demonstrate the efficacy of CapST, achieving substantial improvements in accurately categorizing deep-fake videos over baseline models, all while demanding fewer computational resources.
關聯	ACM Transactions on Multimedia Computing, Communications and Applications, Vol.21, No.4, pp.1-23
資料類型	article
DOI	https://doi.org/10.1145/3715138

dc.contributor	群智博五
dc.creator (作者)	汪新
dc.creator (作者)	Ahmad, Wasim;Peng, Yan Tsung;Chang, Yuan-Hao;Ganfure, Gaddisa Olani;Khan, Sarwar
dc.date (日期)	2025-04
dc.date.accessioned	27-May-2025 11:09:35 (UTC+8)	-
dc.date.available	27-May-2025 11:09:35 (UTC+8)	-
dc.date.issued (上傳時間)	27-May-2025 11:09:35 (UTC+8)	-
dc.identifier.uri (URI)	https://nccur.lib.nccu.edu.tw/handle/140.119/157107	-
dc.description.abstract (摘要)	Deep-fake videos, generated through AI face-swapping techniques, have garnered considerable attention due to their potential for impactful impersonation attacks. While existing research primarily distinguishes real from fake videos, attributing a deep-fake to its specific generation model or encoder is crucial for forensic investigation, enabling precise source tracing and tailored countermeasures. This approach not only enhances detection accuracy by leveraging unique model-specific artifacts but also provides insights essential for developing proactive defenses against evolving deep-fake techniques. Addressing this gap, this article investigates the model attribution problem for deep-fake videos using two datasets—Deepfakes from Different Models (DFDM) and GANGen-Detection, which comprise deep-fake videos and images generated by GAN models. We select only fake images from the GANGen-Detection dataset to align with the DFDM dataset, which specifies the goal of this study, focusing on model attribution rather than real/fake classification. This study formulates deep-fake model attribution as a multiclass classification task, introducing a novel Capsule-Spatial-Temporal (CapST) model that effectively integrates a modified VGG19 (utilizing only the first 26 out of 52 layers) for feature extraction, combined with Capsule Networks and a Spatio-Temporal attention mechanism. The Capsule module captures intricate feature hierarchies, enabling robust identification of deep-fake attributes, while a video-level fusion technique leverages temporal attention mechanisms to process concatenated feature vectors and capture temporal dependencies in deep-fake videos. By aggregating insights across frames, our model achieves a comprehensive understanding of video content, resulting in more precise predictions. Experimental results on the DFDM and GANGen-Detection datasets demonstrate the efficacy of CapST, achieving substantial improvements in accurately categorizing deep-fake videos over baseline models, all while demanding fewer computational resources.
dc.format.extent	95 bytes	-
dc.format.mimetype	text/html	-
dc.relation (關聯)	ACM Transactions on Multimedia Computing, Communications and Applications, Vol.21, No.4, pp.1-23
dc.title (題名)	CapST: Leveraging Capsule Networks and Temporal Attention for Accurate Model Attribution in Deep-fake Videos
dc.type (資料類型)	article
dc.identifier.doi (DOI)	10.1145/3715138
dc.doi.uri (DOI)	https://doi.org/10.1145/3715138