ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training

📅 2025-01-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses zero-shot voice style transfer—transferring arbitrary speaking styles to a source speaker without paired target-style data, while preserving speaker identity and ensuring high-fidelity style rendering. Methodologically, we propose the first framework integrating a speech codec, a disentangled latent diffusion model conditioned on speech prompts, and uncertainty-aware adaptive instance normalization (UMAdaIN). We further introduce an information bottleneck constraint and a novel adversarial training strategy to strengthen context awareness and improve style-content disentanglement. Evaluated on a large-scale dataset of 44,000 hours of speech, our method achieves significant improvements in zero-shot style diversity, speaker timbre preservation, and style similarity. Quantitative and qualitative results demonstrate consistent superiority over existing state-of-the-art approaches across all major metrics.

Technology Category

Application Category

📝 Abstract
Style voice conversion aims to transform the speaking style of source speech into a desired style while keeping the original speaker's identity. However, previous style voice conversion approaches primarily focus on well-defined domains such as emotional aspects, limiting their practical applications. In this study, we present ZSVC, a novel Zero-shot Style Voice Conversion approach that utilizes a speech codec and a latent diffusion model with speech prompting mechanism to facilitate in-context learning for speaking style conversion. To disentangle speaking style and speaker timbre, we introduce information bottleneck to filter speaking style in the source speech and employ Uncertainty Modeling Adaptive Instance Normalization (UMAdaIN) to perturb the speaker timbre in the style prompt. Moreover, we propose a novel adversarial training strategy to enhance in-context learning and improve style similarity. Experiments conducted on 44,000 hours of speech data demonstrate the superior performance of ZSVC in generating speech with diverse speaking styles in zero-shot scenarios.
Problem

Research questions and friction points this paper is trying to address.

Speech Style Transfer
Voice Characteristics Preservation
Naturalness and Accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-Shot Voice Style Conversion
Speaker Identity Preservation
Natural Style Transfer
🔎 Similar Papers
No similar papers found.
Xinfa Zhu
Xinfa Zhu
Northwestern Polytechnical University
speech generation
L
Lei He
Microsoft, Beijing, China
Yujia Xiao
Yujia Xiao
The Chinese University of Hong Kong
Speech
X
Xi Wang
Microsoft, Beijing, China
X
Xu Tan
Microsoft, Beijing, China
Sheng Zhao
Sheng Zhao
Microsoft
Speech
L
Lei Xie
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xian, China