Mitigating Proxy-to-Wild Domain Gap in Deepfake Speech

📅 2026-06-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limited generalization of current detection models against neural audio codec–based voice spoofing (CodecFake) due to domain shift between proxy datasets like CoRS and real-world scenarios. To mitigate this, the authors propose Domain Shift Feature Augmentation (DSFA), which models deterministic feature statistics as stochastic distributions during fine-tuning to better capture the diversity of authentic forged speech. This approach is integrated with a post-trained self-supervised learning (SSL) backbone to enhance generalization. Additionally, they introduce CoSG ExtEval, a more challenging evaluation benchmark comprising 40 unseen generative models and long-duration audio samples. Experimental results demonstrate that the proposed method achieves state-of-the-art performance on both CoSG Eval and CoSG ExtEval, significantly improving robustness against diverse CodecFake attacks.

📝 Abstract

Recent neural audio codec-based speech generation (CodecFake) produces highly realistic audio, posing a challenge to existing deepfake countermeasure models. While using codec resynthesized speech (CoRS) as proxy data improves performance, it often suffers from limited generalization. We propose Domain-Shift Feature Augmentation (DSFA), which simulates "in-the-wild" variations by transforming deterministic feature statistics into stochastic distributions during fine-tuning. To evaluate generalization, we further introduce Codec-based Speech Generation Extension Evaluation (CoSG ExtEval) dataset, a more challenging extension of the CoSG Eval (from CodecFake+) dataset, featuring 40 unseen generative models and long-form audio. Experimental results demonstrate that combining a post-trained SSL backbone with DSFA effectively narrows the proxy-to-wild domain gap. This approach achieves state-of-the-art performance across diverse CodecFake attacks in both CoSG Eval and CoSG ExtEval.

Problem

Research questions and friction points this paper is trying to address.

deepfake speech

domain gap

generalization

proxy data

speech generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Domain-Shift Feature Augmentation

CodecFake

proxy-to-wild domain gap