Toward Noise-Aware Audio Deepfake Detection: Survey, SNR-Benchmarks, and Practical Recipes

📅 2025-12-14

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Audio deepfake detection models suffer significant robustness degradation under realistic acoustic degradations—such as reverberation, environmental noise, and consumer-grade recording channel distortions. Method: This paper introduces the first standardized benchmark for evaluating detection robustness under measured signal-to-noise ratio (SNR)-based noise degradations. We propose a multi-condition training paradigm coupled with fixed-SNR evaluation, defining four fine-grained detection tasks based on authenticity (genuine vs. spoofed) and corruption type. Experiments employ mixed training data from MS-SNSD and ASVspoof 2021 DF, fine-tuning WavLM, Wav2Vec2, and MMS encoders for binary and four-class classification, evaluated via ROC-AUC and EER. Results: Fine-tuning yields 10–15 percentage-point EER reductions at low SNRs (10–0 dB). Crucially, we quantitatively demonstrate—for the first time—that model performance degrades systematically with decreasing SNR, substantially enhancing reliability for real-world deployment.

Technology Category

Application Category

📝 Abstract

Deepfake audio detection has progressed rapidly with strong pre-trained encoders (e.g., WavLM, Wav2Vec2, MMS). However, performance in realistic capture conditions - background noise (domestic/office/transport), room reverberation, and consumer channels - often lags clean-lab results. We survey and evaluate robustness for state-of-the-art audio deepfake detection models and present a reproducible framework that mixes MS-SNSD noises with ASVspoof 2021 DF utterances to evaluate under controlled signal-to-noise ratios (SNRs). SNR is a measured proxy for noise severity used widely in speech; it lets us sweep from near-clean (35 dB) to very noisy (-5 dB) to quantify graceful degradation. We study multi-condition training and fixed-SNR testing for pretrained encoders (WavLM, Wav2Vec2, MMS), reporting accuracy, ROC-AUC, and EER on binary and four-class (authenticity x corruption) tasks. In our experiments, finetuning reduces EER by 10-15 percentage points at 10-0 dB SNR across backbones.

Problem

Research questions and friction points this paper is trying to address.

Evaluates deepfake audio detection robustness in noisy conditions

Proposes a reproducible SNR-based framework for controlled testing

Investigates multi-condition training to improve model performance degradation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Controlled SNR framework for realistic noise evaluation

Multi-condition training improves detection in noisy environments

Fine-tuning reduces error rates across different noise levels

🔎 Similar Papers

Audio Anti-Spoofing Detection: A Survey

2024-04-22arXiv.orgCitations: 25

A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection

2024-09-23arXiv.orgCitations: 1

Anthropic

$350,000—$500,000 USD

San Francisco, CA, USA

AI Research Scientist - Meta Superintelligence Labs (PhD)