š¤ AI Summary
Facing the growing challenge of detecting deepfakes generated by rapidly advancing TTS-based voice cloning systems, this paper proposes ADD-GPāa few-shot adaptive framework designed to efficiently adapt to unseen TTS generators using only a small number (even one) of authentic or synthetic samples. Methodologically, ADD-GP is the first to introduce Gaussian process classification into audio deepfake detection, integrating deep acoustic embeddings with meta-learning to enable personalized adaptation and cross-model generalization without requiring large-scale labeled data. Its key contributions are: (1) eliminating reliance on extensive annotated datasets typical of conventional supervised fine-tuning; (2) achieving >92% one-shot adaptation accuracy on a newly constructed, state-of-the-art voice cloning benchmark; and (3) significantly enhancing robustness and transferability to previously unseen TTS models.
š Abstract
Recent advancements in Text-to-Speech (TTS) models, particularly in voice cloning, have intensified the demand for adaptable and efficient deepfake detection methods. As TTS systems continue to evolve, detection models must be able to efficiently adapt to previously unseen generation models with minimal data. This paper introduces ADD-GP, a few-shot adaptive framework based on a Gaussian Process (GP) classifier for Audio Deepfake Detection (ADD). We show how the combination of a powerful deep embedding model with the Gaussian processes flexibility can achieve strong performance and adaptability. Additionally, we show this approach can also be used for personalized detection, with greater robustness to new TTS models and one-shot adaptability. To support our evaluation, a benchmark dataset is constructed for this task using new state-of-the-art voice cloning models.