Overview of ESDD2: Environment-Aware Speech and Sound Deepfake Detection Challenge

πŸ“… 2026-06-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study addresses the challenge of detecting deepfake audio where speech and environmental sounds are either independently or jointly synthesizedβ€”a scenario in which existing methods exhibit limited generalization to unseen generators and acoustic environments. To overcome the constraints of conventional end-to-end models, the authors propose an environment-aware detection framework that integrates modular task decomposition, cross-domain self-supervised encoding, tailored data augmentation, and selective ensemble strategies. A comprehensive evaluation benchmark is established across five component-level spoofing scenarios, with performance assessed using both Macro-F1 and Equal Error Rate (EER). The proposed system achieves a Macro-F1 score of 0.8775 on the test set, substantially outperforming baseline approaches (0.6327), thereby demonstrating its efficacy while underscoring that detecting forged environmental sounds remains a critical open challenge.
πŸ“ Abstract
The Environment-Aware Speech and Sound Deepfake Detection Challenge (ESDD2), held in conjunction with ICME 2026, evaluated systems for five component-level audio spoofing detection, where speech and environmental sounds may be manipulated independently or jointly. After the challenge concludes, we analyze the final leaderboard and summarize effective design choices from the top-performing submissions. The challenge attracted 94 registrations from 16 countries; after verification of submission requirements and metadata, 13 teams were retained for the final analysis. On the test set, the best system achieved a Macro-F1 score of 0.8775, substantially outperforming the separation-enhanced joint learning baseline (0.6327). Top systems consistently benefited from modular task decomposition, cross-domain self-supervised encoders, targeted data augmentation, and selective ensembling rather than simple model scaling. At the same time, auxiliary EER analyses reveal persistent difficulty in detecting the spoofed environmental component and in generalizing to unseen generators in the test set. This paper reports challenge results and provides insights for future environment-aware deepfake detection research. The CompSpoofV2 dataset and baseline code remain publicly available for reproducibility.
Problem

Research questions and friction points this paper is trying to address.

deepfake detection
audio spoofing
environmental sounds
speech manipulation
generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

modular task decomposition
cross-domain self-supervised encoders
targeted data augmentation
selective ensembling
environment-aware deepfake detection
πŸ”Ž Similar Papers