🤖 AI Summary
To address critical challenges in zero-shot voice conversion (VC)—notably low speaker similarity and training-inference mismatch—this paper proposes a robust method leveraging speech prompts and conditional flow matching (CFM). Our approach introduces three key innovations: (1) the first integration of speech prompts into zero-shot VC to enable in-context learning; (2) a tripartite synergistic mechanism comprising disentangled speech feature representation, a DiT-based CFM decoder, and latent-space mixup; and (3) significant improvements in speaker fidelity and cross-speaker generalization without requiring any target speaker utterances during training. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art zero-shot VC approaches across all major metrics—speaker similarity, intelligibility, and audio quality—while maintaining inference efficiency and robustness.
📝 Abstract
Despite remarkable advancements in recent voice conversion (VC) systems, enhancing speaker similarity in zero-shot scenarios remains challenging. This challenge arises from the difficulty of generalizing and adapting speaker characteristics in speech within zero-shot environments, which is further complicated by mismatch between the training and inference processes. To address these challenges, we propose VoicePrompter, a robust zero-shot VC model that leverages in-context learning with voice prompts. VoicePrompter is composed of (1) a factorization method that disentangles speech components and (2) a DiT-based conditional flow matching (CFM) decoder that conditions on these factorized features and voice prompts. Additionally, (3) latent mixup is used to enhance in-context learning by combining various speaker features. This approach improves speaker similarity and naturalness in zero-shot VC by applying mixup to latent representations. Experimental results demonstrate that VoicePrompter outperforms existing zero-shot VC systems in terms of speaker similarity, speech intelligibility, and audio quality. Our demo is available at url{https://hayeong0.github.io/VoicePrompter-demo/}.