🤖 AI Summary
In zero-shot voice conversion, existing paradigms based on self-supervised features (e.g., wav2vec 2.0) and K-means quantization decouple content from speaker identity but excessively compress fine-grained phonemic and prosodic details, degrading speech intelligibility and prosodic fidelity; moreover, quantization residuals remain underutilized. This paper proposes a single-sentence target-speech-driven method requiring no external validation models or pretrained embeddings. Its core innovations are: (1) a speaker-style compensation mechanism that explicitly models and preserves phonemic and prosodic dynamics using only reconstruction loss; and (2) joint exploitation of both the primary quantized codebook and quantization residuals to overcome expressiveness limitations of small codebooks. Evaluated on six objective metrics, the method achieves state-of-the-art performance, significantly improving naturalness, speaker similarity, and prosodic consistency.
📝 Abstract
One-shot voice conversion (VC) is a method that enables the transformation between any two speakers using only a single target speaker utterance. Existing methods often rely on complex architectures and pre-trained speaker verification (SV) models to improve the fidelity of converted speech. Recent works utilizing K-means quantization (KQ) with self-supervised learning (SSL) features have proven capable of capturing content information from speech. However, they often struggle to preserve speaking variation, such as prosodic detail and phonetic variation, particularly with smaller codebooks. In this work, we propose a simple yet effective one-shot VC model that utilizes the characteristics of SSL features and speech attributes. Our approach addresses the issue of losing speaking variation, enabling high-fidelity voice conversion trained with only reconstruction losses, without requiring external speaker embeddings. We demonstrate the performance of our model across 6 evaluation metrics, with results highlighting the benefits of the speaking variation compensation method.