🤖 AI Summary
This paper addresses controllable voice speaker attribute editing (e.g., age, gender) under zero-shot, non-parallel data conditions. We propose an unsupervised disentanglement framework that requires no paired samples. Methodologically, we introduce the first integration of a GAN-enhanced variational autoencoder with a two-stage acoustic conversion architecture—enabling unsupervised disentanglement and independent manipulation of speaker identity and attributes separately in the speaker embedding space and raw waveform domain. Our key contribution is the first demonstration of fine-grained, attribute-controllable editing on unpaired speech while preserving speaker identifiability and speech naturalness. Experiments show synthesized speech achieves MOS ≥ 4.1 and speaker identity preservation (cosine similarity > 0.89), significantly outperforming existing unsupervised baselines.
📝 Abstract
Attribute control in generative tasks aims to modify personal attributes, such as age and gender while preserving the identity information in the source sample. Although significant progress has been made in controlling facial attributes in image generation, similar approaches for speech generation remain largely unexplored. This letter proposes a novel method for controlling speaker attributes in speech without parallel data. Our approach consists of two main components: a GAN-based speaker representation variational autoencoder that extracts speaker identity and attributes from speaker vector, and a two-stage voice conversion model that captures the natural expression of speaker attributes in speech. Experimental results show that our proposed method not only achieves attribute control at the speaker representation level but also enables manipulation of the speaker age and gender at the speech level while preserving speech quality and speaker identity.