SKQVC: One-Shot Voice Conversion by K-Means Quantization with Self-Supervised Speech Representations

📅 2024-11-25

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 1

career value

200K/year

🤖 AI Summary

In zero-shot voice conversion, existing paradigms based on self-supervised features (e.g., wav2vec 2.0) and K-means quantization decouple content from speaker identity but excessively compress fine-grained phonemic and prosodic details, degrading speech intelligibility and prosodic fidelity; moreover, quantization residuals remain underutilized. This paper proposes a single-sentence target-speech-driven method requiring no external validation models or pretrained embeddings. Its core innovations are: (1) a speaker-style compensation mechanism that explicitly models and preserves phonemic and prosodic dynamics using only reconstruction loss; and (2) joint exploitation of both the primary quantized codebook and quantization residuals to overcome expressiveness limitations of small codebooks. Evaluated on six objective metrics, the method achieves state-of-the-art performance, significantly improving naturalness, speaker similarity, and prosodic consistency.

Technology Category

Application Category

📝 Abstract

One-shot voice conversion (VC) is a method that enables the transformation between any two speakers using only a single target speaker utterance. Existing methods often rely on complex architectures and pre-trained speaker verification (SV) models to improve the fidelity of converted speech. Recent works utilizing K-means quantization (KQ) with self-supervised learning (SSL) features have proven capable of capturing content information from speech. However, they often struggle to preserve speaking variation, such as prosodic detail and phonetic variation, particularly with smaller codebooks. In this work, we propose a simple yet effective one-shot VC model that utilizes the characteristics of SSL features and speech attributes. Our approach addresses the issue of losing speaking variation, enabling high-fidelity voice conversion trained with only reconstruction losses, without requiring external speaker embeddings. We demonstrate the performance of our model across 6 evaluation metrics, with results highlighting the benefits of the speaking variation compensation method.

Problem

Research questions and friction points this paper is trying to address.

Quantization eliminates fine-grained phonetic and prosodic speech variations

Existing methods underutilize quantization residuals in voice conversion

Loss of intelligibility and prosody preservation in zero-shot conversion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes quantization residuals for detail recovery

Employs linear projections for simple disentanglement

Leverages temporal properties of speech components

🔎 Similar Papers

No similar papers found.