Controlling your Attributes in Voice

📅 2025-01-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses controllable voice speaker attribute editing (e.g., age, gender) under zero-shot, non-parallel data conditions. We propose an unsupervised disentanglement framework that requires no paired samples. Methodologically, we introduce the first integration of a GAN-enhanced variational autoencoder with a two-stage acoustic conversion architecture—enabling unsupervised disentanglement and independent manipulation of speaker identity and attributes separately in the speaker embedding space and raw waveform domain. Our key contribution is the first demonstration of fine-grained, attribute-controllable editing on unpaired speech while preserving speaker identifiability and speech naturalness. Experiments show synthesized speech achieves MOS ≥ 4.1 and speaker identity preservation (cosine similarity > 0.89), significantly outperforming existing unsupervised baselines.

Technology Category

Application Category

📝 Abstract
Attribute control in generative tasks aims to modify personal attributes, such as age and gender while preserving the identity information in the source sample. Although significant progress has been made in controlling facial attributes in image generation, similar approaches for speech generation remain largely unexplored. This letter proposes a novel method for controlling speaker attributes in speech without parallel data. Our approach consists of two main components: a GAN-based speaker representation variational autoencoder that extracts speaker identity and attributes from speaker vector, and a two-stage voice conversion model that captures the natural expression of speaker attributes in speech. Experimental results show that our proposed method not only achieves attribute control at the speaker representation level but also enables manipulation of the speaker age and gender at the speech level while preserving speech quality and speaker identity.
Problem

Research questions and friction points this paper is trying to address.

Speaker Attribute Modification
Unsupervised Learning
Speech Processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Speaker Attribute Modification
No Reference Data Required
Naturalness Preservation
🔎 Similar Papers
No similar papers found.
X
Xuyuan Li
Laboratory of Speech and Intelligent Information Processing, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China, and also with the University of Chinese Academy of Sciences, Beijing 100049, China
Zengqiang Shang
Zengqiang Shang
Institute of Acoustics Chinese Academy of Sciences
speech
L
Li Wang
Laboratory of Speech and Intelligent Information Processing, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China
P
Pengyuan Zhang
Laboratory of Speech and Intelligent Information Processing, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China, and also with the University of Chinese Academy of Sciences, Beijing 100049, China