Speaker-Invariant Representation Learning for Spoofing Detection via Gradient Reversal and A Variational Information Bottleneck

📅 2026-06-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limited generalization of existing voice spoofing detection models in cross-domain scenarios, which often stems from their over-reliance on speaker identity cues at the expense of genuine spoofing artifacts. To mitigate this, the authors propose a speaker-label-free teacher–student framework that leverages a pre-trained speaker recognition model as the teacher. A gradient reversal layer steers the student network to learn speaker-invariant representations, while a variational information bottleneck is introduced to balance the suppression of speaker identity information against the preservation of spoofing-related cues. This approach achieves, for the first time, unsupervised learning of speaker-invariant representations for spoofing detection, effectively disentangling speaker and spoofing characteristics. Experiments across nine datasets demonstrate a relative 25.7% reduction in equal error rate (EER) compared to the MHFA baseline.

📝 Abstract

Sophisticated generative speech technology can undermined the reliability of voice biometrics. While spoofing detection systems excel when assessed under in-domain conditions, generalisation to out-of-domain settings is often poor. In this paper, we show that such issues could be caused by speaker bias, where models learn individual voice traits rather than markers of manipulation or generation. We propose a teacher-student framework for speaker-invariant spoofing detection that disentangles identity without requiring speaker labels. We leverage a pre-trained speaker recognition teacher to guide a student model via a gradient reversal layer. To control the balance between suppressing cues related to voice identity with the preservation of those related to spoofing detection, we integrate a Variational Information Bottleneck. Evaluations across nine datasets show our model achieves a 25.7% relative reduction to the EER compared to the MHFA baseline.

Problem

Research questions and friction points this paper is trying to address.

spoofing detection

speaker bias

out-of-domain generalization

speaker-invariant representation

voice biometrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

speaker-invariant

gradient reversal

variational information bottleneck