🤖 AI Summary
The proliferation of synthetic singing voice deepfakes in the music industry poses significant challenges for authenticating vocal content. Method: This paper proposes a deepfake detection method leveraging noise-variant features extracted from Whisper encoders. Departing from conventional approaches that exploit Whisper’s robustness, we first identify and harness its sensitivity to noise—specifically, forged singing voices induce distinctive, scale-dependent (tiny/base/small/medium) encoding variations across Whisper models. These variations are formalized as discriminative features. We further integrate CNN and ResNet34 architectures to jointly model both dry (unmixed) and mixed audio scenarios. Results: Extensive experiments demonstrate that our method achieves significantly lower equal error rates (EER) compared to state-of-the-art baselines, validating the effectiveness and generalizability of noise-variant encoding features for singing voice deepfake detection.
📝 Abstract
The deepfake generation of singing vocals is a concerning issue for artists in the music industry. In this work, we propose a singing voice deepfake detection (SVDD) system, which uses noise-variant encodings of open-AI's Whisper model. As counter-intuitive as it may sound, even though the Whisper model is known to be noise-robust, the encodings are rich in non-speech information, and are noise-variant. This leads us to evaluate Whisper encodings as feature representations for the SVDD task. Therefore, in this work, the SVDD task is performed on vocals and mixtures, and the performance is evaluated in %EER over varying Whisper model sizes and two classifiers- CNN and ResNet34, under different testing conditions.