🤖 AI Summary
This work addresses the degradation of fine-grained expressiveness—such as vibrato and micro-prosody—in singing voice synthesis caused by the mismatch between prior and posterior latent distributions in conditional variational autoencoders (cVAEs). To resolve this, the authors introduce conditional flow matching (CFM) into the latent space for the first time, learning a continuous vector field along optimal transport paths to transform prior samples into posterior-aligned latent representations via ordinary differential equation trajectories. These refined latents are then employed for parallel waveform generation. The proposed approach effectively mitigates the prior-posterior mismatch while preserving efficient parallel decoding, significantly enhancing expressive quality. Experiments demonstrate consistent superiority over strong baselines on both Mandarin and Korean singing datasets, achieving lower mel-cepstral distortion and fundamental frequency error, as well as higher subjective perceptual scores on the Korean dataset.
📝 Abstract
Conditional variational autoencoder (cVAE)-based singing voice synthesis provides efficient inference and strong audio quality by learning a score-conditioned prior and a recording-conditioned posterior latent space. However, because synthesis relies on prior samples while training uses posterior latents inferred from real recordings, imperfect distribution matching can cause a prior-posterior mismatch that degrades fine-grained expressiveness such as vibrato and micro-prosody. We propose FM-Singer, which introduces conditional flow matching (CFM) in latent space to learn a continuous vector field transporting prior latents toward posterior latents along an optimal-transport-inspired path. At inference time, the learned latent flow refines a prior sample by solving an ordinary differential equation (ODE) before waveform generation, improving expressiveness while preserving the efficiency of parallel decoding. Experiments on Korean and Chinese singing datasets demonstrate consistent improvements over strong baselines, including lower mel-cepstral distortion and fundamental-frequency error and higher perceptual scores on the Korean dataset. Code, pretrained checkpoints, and audio demos are available at https://github.com/alsgur9368/FM-Singer