Publications-Periodical Articles

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

題名 AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Deepfake Detection of Frontal Face Videos
作者 彭彥璁
Shahzad, Sahibzada Adil;Hashmi, Ammarah;Peng, Yan-Tsung;Tsao, Yu;Wang, Hsin-Min
貢獻者 資訊系
關鍵詞 Audio-visual; audio-visual deepfake detection; deepfake detection; deepfakes; inconsistency; lip syn; multimedia forensics; multimodality; video forgery
日期 2025-12
上傳時間 12-Mar-2026 15:07:51 (UTC+8)
摘要 Multimodal manipulations (also known as audio-visual deepfakes) make it difficult for unimodal deepfake detectors to detect forgeries in multimedia content. To avoid the spread of false propaganda and fake news, timely detection is crucial. The damage to either modality (i.e., visual or audio) can only be discovered through multimodal models that can exploit both pieces of information simultaneously. However, previous methods mainly adopt unimodal video forensics and use supervised pretraining for forgery detection. This study proposes a new method based on a multimodal self-supervised-learning (SSL) feature extractor to exploit inconsistency between audio and visual modalities for multimodal video forgery detection. We use the transformer-based SSL pretrained Audio-Visual HuBERT (AV-HuBERT) model as a visual and acoustic feature extractor and a multiscale temporal convolutional neural network to capture the temporal correlation between the audio and visual modalities. Since AV-HuBERT only extracts visual features from the lip region, we also adopt another transformer-based video model to exploit facial features and capture spatial and temporal artifacts caused during the deepfake generation process. Experimental results show that our model outperforms all existing models and achieves new state-of-the-art performance on the FakeAVCeleb and DeepfakeTIMIT datasets.
關聯 IEEE Transactions on Human-Machine Systems, Vol.55, No.6, pp.973-982
資料類型 article
DOI https://doi.org/10.1109/THMS.2025.3618409
dc.contributor 資訊系
dc.creator (作者) 彭彥璁
dc.creator (作者) Shahzad, Sahibzada Adil;Hashmi, Ammarah;Peng, Yan-Tsung;Tsao, Yu;Wang, Hsin-Min
dc.date (日期) 2025-12
dc.date.accessioned 12-Mar-2026 15:07:51 (UTC+8)-
dc.date.available 12-Mar-2026 15:07:51 (UTC+8)-
dc.date.issued (上傳時間) 12-Mar-2026 15:07:51 (UTC+8)-
dc.identifier.uri (URI) https://nccur.lib.nccu.edu.tw/handle/140.119/162045-
dc.description.abstract (摘要) Multimodal manipulations (also known as audio-visual deepfakes) make it difficult for unimodal deepfake detectors to detect forgeries in multimedia content. To avoid the spread of false propaganda and fake news, timely detection is crucial. The damage to either modality (i.e., visual or audio) can only be discovered through multimodal models that can exploit both pieces of information simultaneously. However, previous methods mainly adopt unimodal video forensics and use supervised pretraining for forgery detection. This study proposes a new method based on a multimodal self-supervised-learning (SSL) feature extractor to exploit inconsistency between audio and visual modalities for multimodal video forgery detection. We use the transformer-based SSL pretrained Audio-Visual HuBERT (AV-HuBERT) model as a visual and acoustic feature extractor and a multiscale temporal convolutional neural network to capture the temporal correlation between the audio and visual modalities. Since AV-HuBERT only extracts visual features from the lip region, we also adopt another transformer-based video model to exploit facial features and capture spatial and temporal artifacts caused during the deepfake generation process. Experimental results show that our model outperforms all existing models and achieves new state-of-the-art performance on the FakeAVCeleb and DeepfakeTIMIT datasets.
dc.format.extent 105 bytes-
dc.format.mimetype text/html-
dc.relation (關聯) IEEE Transactions on Human-Machine Systems, Vol.55, No.6, pp.973-982
dc.subject (關鍵詞) Audio-visual; audio-visual deepfake detection; deepfake detection; deepfakes; inconsistency; lip syn; multimedia forensics; multimodality; video forgery
dc.title (題名) AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Deepfake Detection of Frontal Face Videos
dc.type (資料類型) article
dc.identifier.doi (DOI) 10.1109/THMS.2025.3618409
dc.doi.uri (DOI) https://doi.org/10.1109/THMS.2025.3618409