Adapting General Disentanglement-Based Speaker Anonymization for Enhanced Emotion Preservation

📅 2024-08-12
🏛️ arXiv.org
📈 Citations: 1
Influential: 1
📄 PDF
🤖 AI Summary
This study addresses severe emotional distortion in disentangled speaker anonymization. We propose a novel method that jointly ensures strong privacy protection and high emotional fidelity. Methodologically, we (1) design a pre-trained emotion encoder to explicitly disentangle emotion representations from speaker identity; (2) introduce the first SVM-based emotion boundary modeling and directional embedding-space compensation mechanism, enabling fine-grained adjustment of anonymized speaker embeddings along emotion gradients; and (3) integrate dual-path compensation—encoder-level feature fusion and post-hoc refinement. Experiments demonstrate that our approach achieves state-of-the-art speaker unidentifiability (ASR-based ID error >95%) while improving emotion recognition accuracy by 12.7% over existing disentanglement-based anonymization methods. Moreover, the framework is extensible to controllable preservation of other paralinguistic attributes, such as age and accent.

Technology Category

Application Category

📝 Abstract
A general disentanglement-based speaker anonymization system typically separates speech into content, speaker, and prosody features using individual encoders. This paper explores how to adapt such a system when a new speech attribute, for example, emotion, needs to be preserved to a greater extent. While existing systems are good at anonymizing speaker embeddings, they are not designed to preserve emotion. Two strategies for this are examined. First, we show that integrating emotion embeddings from a pre-trained emotion encoder can help preserve emotional cues, even though this approach slightly compromises privacy protection. Alternatively, we propose an emotion compensation strategy as a post-processing step applied to anonymized speaker embeddings. This conceals the original speaker's identity and reintroduces the emotional traits lost during speaker embedding anonymization. Specifically, we model the emotion attribute using support vector machines to learn separate boundaries for each emotion. During inference, the original speaker embedding is processed in two ways: one, by an emotion indicator to predict emotion and select the emotion-matched SVM accurately; and two, by a speaker anonymizer to conceal speaker characteristics. The anonymized speaker embedding is then modified along the corresponding SVM boundary towards an enhanced emotional direction to save the emotional cues. The proposed strategies are also expected to be useful for adapting a general disentanglement-based speaker anonymization system to preserve other target paralinguistic attributes, with potential for a range of downstream tasks.
Problem

Research questions and friction points this paper is trying to address.

Adapt speaker anonymization to preserve emotional cues
Balance privacy protection and emotion retention
Extend system to preserve other paralinguistic attributes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrate emotion embeddings from pre-trained encoder
Propose emotion compensation post-processing strategy
Use SVM to model and enhance emotional attributes
🔎 Similar Papers
No similar papers found.
Xiaoxiao Miao
Xiaoxiao Miao
Duke Kunshan University
Speech PrivacySpeaker and Language IdentificationSpeech Synthesis
Y
Yuxiang Zhang
Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China
X
Xin Wang
National Institute of Informatics, Chiyoda-ku, Tokyo 101-8340, Japan
N
N. Tomashenko
Inria, Centre Inria de l’Université de Lorraine, France
D
D. Soh
Singapore Institute of Technology, Singapore, 567739
I
Ian Mcloughlin
Singapore Institute of Technology, Singapore, 567739