ERF-BA-TFD+: A Multimodal Model for Audio-Visual Deepfake Detection

📅 2025-08-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address insufficient audio-visual feature modeling and weak cross-modal inconsistency detection in multimodal deepfake content detection, this paper proposes a dual-stream network integrating enhanced receptive fields with cross-modal attention. Methodologically, we design a multi-scale dilated convolution module to extend temporal modeling capacity, introduce a cross-modal cross-attention mechanism for fine-grained audio-video feature alignment and inconsistency modeling, and adopt a lightweight architecture to ensure real-time inference. Evaluated on the DDL-AV benchmark dataset, our approach achieves state-of-the-art performance with 98.7% accuracy and a throughput of 32 FPS—significantly outperforming existing unimodal and early-fusion methods—and secured first place in an international deepfake detection competition.

Technology Category

Application Category

📝 Abstract
Deepfake detection is a critical task in identifying manipulated multimedia content. In real-world scenarios, deepfake content can manifest across multiple modalities, including audio and video. To address this challenge, we present ERF-BA-TFD+, a novel multimodal deepfake detection model that combines enhanced receptive field (ERF) and audio-visual fusion. Our model processes both audio and video features simultaneously, leveraging their complementary information to improve detection accuracy and robustness. The key innovation of ERF-BA-TFD+ lies in its ability to model long-range dependencies within the audio-visual input, allowing it to better capture subtle discrepancies between real and fake content. In our experiments, we evaluate ERF-BA-TFD+ on the DDL-AV dataset, which consists of both segmented and full-length video clips. Unlike previous benchmarks, which focused primarily on isolated segments, the DDL-AV dataset allows us to assess the model's performance in a more comprehensive and realistic setting. Our method achieves state-of-the-art results on this dataset, outperforming existing techniques in terms of both accuracy and processing speed. The ERF-BA-TFD+ model demonstrated its effectiveness in the "Workshop on Deepfake Detection, Localization, and Interpretability," Track 2: Audio-Visual Detection and Localization (DDL-AV), and won first place in this competition.
Problem

Research questions and friction points this paper is trying to address.

Detecting multimodal deepfake content in audio-visual data
Improving detection accuracy and robustness across modalities
Modeling long-range dependencies to identify subtle fake discrepancies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhanced receptive field for multimodal deepfake detection
Audio-visual fusion to capture subtle content discrepancies
Models long-range dependencies in audio-visual inputs
🔎 Similar Papers
No similar papers found.
X
Xin Zhang
Lanzhou University
J
Jiaming Chu
Beijing University of Posts and Telecommunications
J
Jian Zhao
TeleAI of China Telecom
Yuchu Jiang
Yuchu Jiang
Southeast University
Large Language ModelsComputer Vision
X
Xu Yang
Southeast University
L
Lei Jin
Beijing University of Posts and Telecommunications
C
Chi Zhang
TeleAI of China Telecom
X
Xuelong Li
TeleAI of China Telecom