Why disentanglement-based speaker anonymization systems fail at preserving emotions?

📅 2025-01-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current speaker anonymization systems severely degrade emotional information during identity disentanglement, frequently misclassifying anonymized speech as “angry.” This work identifies, for the first time, the absence of emotional encoding in intermediate representations as the root cause. We further find that generative speaker embeddings and vocoder-induced synthesis artifacts—particularly those elevating spectral kurtosis—exacerbate this bias. Through decoupled representation learning, a GAN-based vocoder, Wav2Vec 2.0–based emotion recognition, and spectral statistical analysis, we quantitatively attribute contributions across modules: emotional deficiency in intermediate representations dominates (>60%), followed by speaker embedding effects; out-of-distribution (OOD) impacts from the vocoder are comparatively minor. To address evaluation shortcomings, we propose a novel emotion-preservation assessment protocol based on unweighted average recall, which significantly improves robustness against annotation bias and synthesis artifacts.

Technology Category

Application Category

📝 Abstract
Disentanglement-based speaker anonymization involves decomposing speech into a semantically meaningful representation, altering the speaker embedding, and resynthesizing a waveform using a neural vocoder. State-of-the-art systems of this kind are known to remove emotion information. Possible reasons include mode collapse in GAN-based vocoders, unintended modeling and modification of emotions through speaker embeddings, or excessive sanitization of the intermediate representation. In this paper, we conduct a comprehensive evaluation of a state-of-the-art speaker anonymization system to understand the underlying causes. We conclude that the main reason is the lack of emotion-related information in the intermediate representation. The speaker embeddings also have a high impact, if they are learned in a generative context. The vocoder's out-of-distribution performance has a smaller impact. Additionally, we discovered that synthesis artifacts increase spectral kurtosis, biasing emotion recognition evaluation towards classifying utterances as angry. Therefore, we conclude that reporting unweighted average recall alone for emotion recognition performance is suboptimal.
Problem

Research questions and friction points this paper is trying to address.

Speaker Anonymization
Emotional Retention
Feature Disentanglement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Speaker Anonymization
Emotional Information Preservation
Performance Evaluation in Emotional Recognition
🔎 Similar Papers
No similar papers found.
U
Unal Ege Gaznepoglu
International Audio Laboratories Erlangen, Friedrich-Alexander-University Erlangen-Nürnberg, Germany
Nils Peters
Nils Peters
Assistant Professor for Acoustics, Audio DSP & ML at Trinity College Dublin
Spatial AudioDigital Signal ProcessingAuditory PerceptionMachine PerceptionInternet of