ReFlow-VC: Zero-shot Voice Conversion Based on Rectified Flow and Speaker Feature Optimization

πŸ“… 2025-06-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the high sampling step count and low inference efficiency of diffusion models in voice conversion, this paper proposes a zero-shot voice conversion method based on Rectified Flow (RF). It is the first work to introduce RF into voice conversion, modeling Mel-spectrogram generation as an ordinary differential equation (ODE) that follows the shortest path in latent spaceβ€”thereby eliminating the iterative denoising process inherent in traditional DDPMs. We further design a dynamic speaker embedding optimization mechanism jointly conditioned on phoneme content and fundamental frequency (F0), enhancing timbre fidelity and few-shot generalization capability. Experiments demonstrate that our method achieves a MOS improvement of over 0.8 in both zero-shot and low-resource settings, while accelerating inference by 5–10Γ— compared to DDPM-based approaches, significantly outperforming existing diffusion-based voice conversion models.

Technology Category

Application Category

πŸ“ Abstract
In recent years, diffusion-based generative models have demonstrated remarkable performance in speech conversion, including Denoising Diffusion Probabilistic Models (DDPM) and others. However, the advantages of these models come at the cost of requiring a large number of sampling steps. This limitation hinders their practical application in real-world scenarios. In this paper, we introduce ReFlow-VC, a novel high-fidelity speech conversion method based on rectified flow. Specifically, ReFlow-VC is an Ordinary Differential Equation (ODE) model that transforms a Gaussian distribution to the true Mel-spectrogram distribution along the most direct path. Furthermore, we propose a modeling approach that optimizes speaker features by utilizing both content and pitch information, allowing speaker features to reflect the properties of the current speech more accurately. Experimental results show that ReFlow-VC performs exceptionally well in small datasets and zero-shot scenarios.
Problem

Research questions and friction points this paper is trying to address.

Reduces sampling steps in diffusion-based voice conversion models
Optimizes speaker features using content and pitch information
Enhances zero-shot voice conversion performance in small datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses rectified flow for direct path conversion
Optimizes speaker features with content and pitch
Performs well in zero-shot and small datasets
πŸ”Ž Similar Papers
No similar papers found.
Pengyu Ren
Pengyu Ren
Department of Biomedical Engineering, The University of Texas at Austin
Molecular modelingprotein-ligand bindingfree energy simulationsRNA structuresbiomaterials
Wenhao Guan
Wenhao Guan
Xiamen University
speech
K
Kaidi Wang
School of Informforms, Xiamen University, China
P
Peijie Chen
School of Informatics, Xiamen University, China
Q
Q. Hong
School of Informatics, Xiamen University, China
L
Lin Li
School of Electronic Science and Engineering, Xiamen University, China