🤖 AI Summary
This work exposes the severe threat posed by replay attacks—where synthetic speech is played through loudspeakers and re-recorded via microphones—to deepfake audio detection. Such physical-layer attacks significantly degrade detector performance, causing false acceptance of forged speech as genuine. To address this gap, the authors introduce ReplayDF, the first systematic, cross-lingual, multi-device, and multi-TTS benchmark dataset for replay attacks, constructed from M-AILABS and MLAAD and incorporating realistic acoustic channel distortions. Evaluation across six state-of-the-art detectors—including W2V2-AASIST—reveals that replay attacks increase W2V2-AASIST’s equal error rate (EER) from 4.7% to 18.2%. Even after adaptive retraining with room impulse responses (RIRs), EER remains as high as 11.0%, demonstrating the fundamental vulnerability of current methods in real-world settings. This work establishes a critical benchmark and provides empirical evidence to advance robust audio deepfake detection.
📝 Abstract
We show how replay attacks undermine audio deepfake detection: By playing and re-recording deepfake audio through various speakers and microphones, we make spoofed samples appear authentic to the detection model. To study this phenomenon in more detail, we introduce ReplayDF, a dataset of recordings derived from M-AILABS and MLAAD, featuring 109 speaker-microphone combinations across six languages and four TTS models. It includes diverse acoustic conditions, some highly challenging for detection. Our analysis of six open-source detection models across five datasets reveals significant vulnerability, with the top-performing W2V2-AASIST model's Equal Error Rate (EER) surging from 4.7% to 18.2%. Even with adaptive Room Impulse Response (RIR) retraining, performance remains compromised with an 11.0% EER. We release ReplayDF for non-commercial research use.