Training-Free Voice Conversion with Factorized Optimal Transport

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses zero-shot, arbitrary-to-arbitrary cross-lingual voice conversion with only 5 seconds of reference speech, requiring no training. The method leverages unsupervised cross-lingual alignment and WavLM representations to achieve robust content–acoustic disentanglement. Key contributions include: (1) replacing conventional kNN regression with a factorized optimal transport mapping for speaker identity transfer; (2) introducing the Monge–Kantorovich linear solution (MKL) within a WavLM feature subspace to mitigate anisotropic variance in high-dimensional embeddings; and (3) enabling fully zero-shot, language-agnostic conversion via geometrically principled feature transport. Experiments on LibriSpeech and FLEURS demonstrate substantial improvements in content fidelity and robustness to short-duration references. The proposed approach matches or exceeds the cross-lingual performance of supervised methods such as FACodec, while eliminating reliance on parallel data, speaker-specific fine-tuning, or language-pair specialization.

Technology Category

Application Category

📝 Abstract
This paper introduces Factorized MKL-VC, a training-free modification for kNN-VC pipeline. In contrast with original pipeline, our algorithm performs high quality any-to-any cross-lingual voice conversion with only 5 second of reference audio. MKL-VC replaces kNN regression with a factorized optimal transport map in WavLM embedding subspaces, derived from Monge-Kantorovich Linear solution. Factorization addresses non-uniform variance across dimensions, ensuring effective feature transformation. Experiments on LibriSpeech and FLEURS datasets show MKL-VC significantly improves content preservation and robustness with short reference audio, outperforming kNN-VC. MKL-VC achieves performance comparable to FACodec, especially in cross-lingual voice conversion domain.
Problem

Research questions and friction points this paper is trying to address.

Achieves high-quality any-to-any cross-lingual voice conversion
Requires only 5 seconds of reference audio
Improves content preservation and robustness in voice conversion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Factorized optimal transport for voice conversion
Training-free modification with 5-second reference
Cross-lingual performance comparable to FACodec
🔎 Similar Papers
No similar papers found.