VoicePrompter: Robust Zero-Shot Voice Conversion with Voice Prompt and Conditional Flow Matching

📅 2025-01-29

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

To address critical challenges in zero-shot voice conversion (VC)—notably low speaker similarity and training-inference mismatch—this paper proposes a robust method leveraging speech prompts and conditional flow matching (CFM). Our approach introduces three key innovations: (1) the first integration of speech prompts into zero-shot VC to enable in-context learning; (2) a tripartite synergistic mechanism comprising disentangled speech feature representation, a DiT-based CFM decoder, and latent-space mixup; and (3) significant improvements in speaker fidelity and cross-speaker generalization without requiring any target speaker utterances during training. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art zero-shot VC approaches across all major metrics—speaker similarity, intelligibility, and audio quality—while maintaining inference efficiency and robustness.

Technology Category

Application Category

📝 Abstract

Despite remarkable advancements in recent voice conversion (VC) systems, enhancing speaker similarity in zero-shot scenarios remains challenging. This challenge arises from the difficulty of generalizing and adapting speaker characteristics in speech within zero-shot environments, which is further complicated by mismatch between the training and inference processes. To address these challenges, we propose VoicePrompter, a robust zero-shot VC model that leverages in-context learning with voice prompts. VoicePrompter is composed of (1) a factorization method that disentangles speech components and (2) a DiT-based conditional flow matching (CFM) decoder that conditions on these factorized features and voice prompts. Additionally, (3) latent mixup is used to enhance in-context learning by combining various speaker features. This approach improves speaker similarity and naturalness in zero-shot VC by applying mixup to latent representations. Experimental results demonstrate that VoicePrompter outperforms existing zero-shot VC systems in terms of speaker similarity, speech intelligibility, and audio quality. Our demo is available at url{https://hayeong0.github.io/VoicePrompter-demo/}.

Problem

Research questions and friction points this paper is trying to address.

Speech Conversion

Unseen Speakers

Voice Fidelity

Innovation

Methods, ideas, or system contributions that make the work stand out.

VoicePrompter

sound decomposition

speaker-adaptive learning

🔎 Similar Papers

No similar papers found.