SEF-MK: Speaker-Embedding-Free Voice Anonymization through Multi-k-means Quantization

📅 2025-08-09

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This study addresses the privacy–utility trade-off challenge in voice anonymization caused by reliance on speaker embeddings. We propose SEF-MK, a speaker-embedding-free framework that applies multiple independently trained k-means models to randomly quantize subsets of self-supervised speech representations (e.g., wav2vec 2.0), without using any speaker embedding. By employing multi-branch clustering, SEF-MK disrupts speaker-discriminative structure in the representation space, effectively disentangling identity information while improving speech intelligibility and emotion preservation. Its core innovation lies in replacing single-model quantization with unsupervised multi-model quantization, significantly enhancing naturalness and content fidelity from the user’s perspective. Notably, experiments reveal an unintended consequence: certain attackers may achieve improved voice reconstruction, exposing a previously unreported privacy paradox inherent in multi-model configurations. Evaluation shows SEF-MK outperforms the single-k-means baseline by +4.2% in ASR accuracy and +3.8% in emotion recognition accuracy, establishing a new embedding-free paradigm for voice anonymization and offering critical security design insights.

Technology Category

Application Category

📝 Abstract

Voice anonymization protects speaker privacy by concealing identity while preserving linguistic and paralinguistic content. Self-supervised learning (SSL) representations encode linguistic features but preserve speaker traits. We propose a novel speaker-embedding-free framework called SEF-MK. Instead of using a single k-means model trained on the entire dataset, SEF-MK anonymizes SSL representations for each utterance by randomly selecting one of multiple k-means models, each trained on a different subset of speakers. We explore this approach from both attacker and user perspectives. Extensive experiments show that, compared to a single k-means model, SEF-MK with multiple k-means models better preserves linguistic and emotional content from the user's viewpoint. However, from the attacker's perspective, utilizing multiple k-means models boosts the effectiveness of privacy attacks. These insights can aid users in designing voice anonymization systems to mitigate attacker threats.

Problem

Research questions and friction points this paper is trying to address.

Voice anonymization without speaker embeddings

Balancing privacy and content preservation

Mitigating privacy attack effectiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-k-means quantization for anonymization

Random selection of k-means models per utterance

Speaker-embedding-free SSL representation processing

🔎 Similar Papers

No similar papers found.