Mitigating Proxy-to-Wild Domain Gap in Deepfake Speech

📅 2026-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited generalization of current detection models against neural audio codec–based voice spoofing (CodecFake) due to domain shift between proxy datasets like CoRS and real-world scenarios. To mitigate this, the authors propose Domain Shift Feature Augmentation (DSFA), which models deterministic feature statistics as stochastic distributions during fine-tuning to better capture the diversity of authentic forged speech. This approach is integrated with a post-trained self-supervised learning (SSL) backbone to enhance generalization. Additionally, they introduce CoSG ExtEval, a more challenging evaluation benchmark comprising 40 unseen generative models and long-duration audio samples. Experimental results demonstrate that the proposed method achieves state-of-the-art performance on both CoSG Eval and CoSG ExtEval, significantly improving robustness against diverse CodecFake attacks.
📝 Abstract
Recent neural audio codec-based speech generation (CodecFake) produces highly realistic audio, posing a challenge to existing deepfake countermeasure models. While using codec resynthesized speech (CoRS) as proxy data improves performance, it often suffers from limited generalization. We propose Domain-Shift Feature Augmentation (DSFA), which simulates "in-the-wild" variations by transforming deterministic feature statistics into stochastic distributions during fine-tuning. To evaluate generalization, we further introduce Codec-based Speech Generation Extension Evaluation (CoSG ExtEval) dataset, a more challenging extension of the CoSG Eval (from CodecFake+) dataset, featuring 40 unseen generative models and long-form audio. Experimental results demonstrate that combining a post-trained SSL backbone with DSFA effectively narrows the proxy-to-wild domain gap. This approach achieves state-of-the-art performance across diverse CodecFake attacks in both CoSG Eval and CoSG ExtEval.
Problem

Research questions and friction points this paper is trying to address.

deepfake speech
domain gap
generalization
proxy data
speech generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Domain-Shift Feature Augmentation
CodecFake
proxy-to-wild domain gap
CoSG ExtEval
self-supervised learning
🔎 Similar Papers