AUDETER: A Large-scale Dataset for Deepfake Audio Detection in Open Worlds

📅 2025-09-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deepfake audio detection suffers from poor generalization under open-world domain shifts. Method: This paper introduces AUDETER, a large-scale, highly diverse deepfake audio dataset comprising over 4,500 hours of synthetic speech generated by 11 state-of-the-art TTS models combined with 10 vocoders, capturing dual diversity across authentic human speech and cutting-edge forgery techniques. It also establishes the first open-world deepfake audio evaluation benchmark to systematically expose performance bottlenecks of existing methods in cross-domain settings. Contribution/Results: Detection models trained on AUDETER achieve a false positive rate of only 4.17% on cross-domain benchmarks such as In-the-Wild—reducing error rates by 44.1%–51.6% over SOTA approaches—demonstrating substantial improvements in robustness and generalization. AUDETER thus provides critical infrastructure and empirical validation for developing universal deepfake audio detectors.

Technology Category

Application Category

📝 Abstract
Speech generation systems can produce remarkably realistic vocalisations that are often indistinguishable from human speech, posing significant authenticity challenges. Although numerous deepfake detection methods have been developed, their effectiveness in real-world environments remains unrealiable due to the domain shift between training and test samples arising from diverse human speech and fast evolving speech synthesis systems. This is not adequately addressed by current datasets, which lack real-world application challenges with diverse and up-to-date audios in both real and deep-fake categories. To fill this gap, we introduce AUDETER (AUdio DEepfake TEst Range), a large-scale, highly diverse deepfake audio dataset for comprehensive evaluation and robust development of generalised models for deepfake audio detection. It consists of over 4,500 hours of synthetic audio generated by 11 recent TTS models and 10 vocoders with a broad range of TTS/vocoder patterns, totalling 3 million audio clips, making it the largest deepfake audio dataset by scale. Through extensive experiments with AUDETER, we reveal that i) state-of-the-art (SOTA) methods trained on existing datasets struggle to generalise to novel deepfake audio samples and suffer from high false positive rates on unseen human voice, underscoring the need for a comprehensive dataset; and ii) these methods trained on AUDETER achieve highly generalised detection performance and significantly reduce detection error rate by 44.1% to 51.6%, achieving an error rate of only 4.17% on diverse cross-domain samples in the popular In-the-Wild dataset, paving the way for training generalist deepfake audio detectors. AUDETER is available on GitHub.
Problem

Research questions and friction points this paper is trying to address.

Detecting deepfake audio in real-world environments
Addressing domain shift between training and test samples
Improving generalization to novel deepfake audio samples
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset with diverse TTS and vocoder patterns
Over 3 million audio clips totaling 4500+ hours
Reduces detection error rate by 44.1-51.6%
🔎 Similar Papers
No similar papers found.