AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Deepfake Detection of Frontal ... | Publication | NCCU Academic Hub

Publications-Periodical Articles

Article View/Open

html(0)

Publication Export

Google Scholar^TM

NCCU Library

Discovery System

Citation Infomation

Related Publications in TAIR

Simple Record
Full Record

題名	AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Deepfake Detection of Frontal Face Videos
作者	彭彥璁 Shahzad, Sahibzada Adil;Hashmi, Ammarah;Peng, Yan-Tsung;Tsao, Yu;Wang, Hsin-Min
貢獻者	資訊系
關鍵詞	Audio-visual; audio-visual deepfake detection; deepfake detection; deepfakes; inconsistency; lip syn; multimedia forensics; multimodality; video forgery
日期	2025-12
上傳時間	12-Mar-2026 15:07:51 (UTC+8)
摘要	Multimodal manipulations (also known as audio-visual deepfakes) make it difficult for unimodal deepfake detectors to detect forgeries in multimedia content. To avoid the spread of false propaganda and fake news, timely detection is crucial. The damage to either modality (i.e., visual or audio) can only be discovered through multimodal models that can exploit both pieces of information simultaneously. However, previous methods mainly adopt unimodal video forensics and use supervised pretraining for forgery detection. This study proposes a new method based on a multimodal self-supervised-learning (SSL) feature extractor to exploit inconsistency between audio and visual modalities for multimodal video forgery detection. We use the transformer-based SSL pretrained Audio-Visual HuBERT (AV-HuBERT) model as a visual and acoustic feature extractor and a multiscale temporal convolutional neural network to capture the temporal correlation between the audio and visual modalities. Since AV-HuBERT only extracts visual features from the lip region, we also adopt another transformer-based video model to exploit facial features and capture spatial and temporal artifacts caused during the deepfake generation process. Experimental results show that our model outperforms all existing models and achieves new state-of-the-art performance on the FakeAVCeleb and DeepfakeTIMIT datasets.
關聯	IEEE Transactions on Human-Machine Systems, Vol.55, No.6, pp.973-982
資料類型	article
DOI	https://doi.org/10.1109/THMS.2025.3618409

dc.contributor	資訊系
dc.creator (作者)	彭彥璁
dc.creator (作者)	Shahzad, Sahibzada Adil;Hashmi, Ammarah;Peng, Yan-Tsung;Tsao, Yu;Wang, Hsin-Min
dc.date (日期)	2025-12
dc.date.accessioned	12-Mar-2026 15:07:51 (UTC+8)	-
dc.date.available	12-Mar-2026 15:07:51 (UTC+8)	-
dc.date.issued (上傳時間)	12-Mar-2026 15:07:51 (UTC+8)	-
dc.identifier.uri (URI)	https://nccur.lib.nccu.edu.tw/handle/140.119/162045	-
dc.description.abstract (摘要)	Multimodal manipulations (also known as audio-visual deepfakes) make it difficult for unimodal deepfake detectors to detect forgeries in multimedia content. To avoid the spread of false propaganda and fake news, timely detection is crucial. The damage to either modality (i.e., visual or audio) can only be discovered through multimodal models that can exploit both pieces of information simultaneously. However, previous methods mainly adopt unimodal video forensics and use supervised pretraining for forgery detection. This study proposes a new method based on a multimodal self-supervised-learning (SSL) feature extractor to exploit inconsistency between audio and visual modalities for multimodal video forgery detection. We use the transformer-based SSL pretrained Audio-Visual HuBERT (AV-HuBERT) model as a visual and acoustic feature extractor and a multiscale temporal convolutional neural network to capture the temporal correlation between the audio and visual modalities. Since AV-HuBERT only extracts visual features from the lip region, we also adopt another transformer-based video model to exploit facial features and capture spatial and temporal artifacts caused during the deepfake generation process. Experimental results show that our model outperforms all existing models and achieves new state-of-the-art performance on the FakeAVCeleb and DeepfakeTIMIT datasets.
dc.format.extent	105 bytes	-
dc.format.mimetype	text/html	-
dc.relation (關聯)	IEEE Transactions on Human-Machine Systems, Vol.55, No.6, pp.973-982
dc.subject (關鍵詞)	Audio-visual; audio-visual deepfake detection; deepfake detection; deepfakes; inconsistency; lip syn; multimedia forensics; multimodality; video forgery
dc.title (題名)	AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Deepfake Detection of Frontal Face Videos
dc.type (資料類型)	article
dc.identifier.doi (DOI)	10.1109/THMS.2025.3618409
dc.doi.uri (DOI)	https://doi.org/10.1109/THMS.2025.3618409