🤖 AI Summary
This study addresses the privacy–utility trade-off challenge in voice anonymization caused by reliance on speaker embeddings. We propose SEF-MK, a speaker-embedding-free framework that applies multiple independently trained k-means models to randomly quantize subsets of self-supervised speech representations (e.g., wav2vec 2.0), without using any speaker embedding. By employing multi-branch clustering, SEF-MK disrupts speaker-discriminative structure in the representation space, effectively disentangling identity information while improving speech intelligibility and emotion preservation. Its core innovation lies in replacing single-model quantization with unsupervised multi-model quantization, significantly enhancing naturalness and content fidelity from the user’s perspective. Notably, experiments reveal an unintended consequence: certain attackers may achieve improved voice reconstruction, exposing a previously unreported privacy paradox inherent in multi-model configurations. Evaluation shows SEF-MK outperforms the single-k-means baseline by +4.2% in ASR accuracy and +3.8% in emotion recognition accuracy, establishing a new embedding-free paradigm for voice anonymization and offering critical security design insights.
📝 Abstract
Voice anonymization protects speaker privacy by concealing identity while preserving linguistic and paralinguistic content. Self-supervised learning (SSL) representations encode linguistic features but preserve speaker traits. We propose a novel speaker-embedding-free framework called SEF-MK. Instead of using a single k-means model trained on the entire dataset, SEF-MK anonymizes SSL representations for each utterance by randomly selecting one of multiple k-means models, each trained on a different subset of speakers. We explore this approach from both attacker and user perspectives. Extensive experiments show that, compared to a single k-means model, SEF-MK with multiple k-means models better preserves linguistic and emotional content from the user's viewpoint. However, from the attacker's perspective, utilizing multiple k-means models boosts the effectiveness of privacy attacks. These insights can aid users in designing voice anonymization systems to mitigate attacker threats.