🤖 AI Summary
This paper addresses zero-shot, arbitrary-to-arbitrary cross-lingual voice conversion with only 5 seconds of reference speech, requiring no training. The method leverages unsupervised cross-lingual alignment and WavLM representations to achieve robust content–acoustic disentanglement. Key contributions include: (1) replacing conventional kNN regression with a factorized optimal transport mapping for speaker identity transfer; (2) introducing the Monge–Kantorovich linear solution (MKL) within a WavLM feature subspace to mitigate anisotropic variance in high-dimensional embeddings; and (3) enabling fully zero-shot, language-agnostic conversion via geometrically principled feature transport. Experiments on LibriSpeech and FLEURS demonstrate substantial improvements in content fidelity and robustness to short-duration references. The proposed approach matches or exceeds the cross-lingual performance of supervised methods such as FACodec, while eliminating reliance on parallel data, speaker-specific fine-tuning, or language-pair specialization.
📝 Abstract
This paper introduces Factorized MKL-VC, a training-free modification for kNN-VC pipeline. In contrast with original pipeline, our algorithm performs high quality any-to-any cross-lingual voice conversion with only 5 second of reference audio. MKL-VC replaces kNN regression with a factorized optimal transport map in WavLM embedding subspaces, derived from Monge-Kantorovich Linear solution. Factorization addresses non-uniform variance across dimensions, ensuring effective feature transformation. Experiments on LibriSpeech and FLEURS datasets show MKL-VC significantly improves content preservation and robustness with short reference audio, outperforming kNN-VC. MKL-VC achieves performance comparable to FACodec, especially in cross-lingual voice conversion domain.