MeanVC 2: Robust Low-Latency Streaming Zero-Shot Voice Conversion

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenges of degraded audio quality, high sensitivity to reference audio quality, and excessive latency in existing streaming zero-shot voice conversion systems under small chunk settings. To overcome these limitations, the authors propose MeanVC 2, which introduces a future-aware chunking (FRC) mechanism to enhance conversion stability with small chunks and incorporates a universal timbre token encoder to improve robustness against low-quality references while preserving speaker similarity. Built upon a diffusion transformer architecture, the model leverages cross-attention and global speaker embeddings to construct an efficient timbre representation. Experimental results demonstrate that MeanVC 2 achieves stable conversion with 40 ms chunks, reduces system latency from 211 ms to 110 ms, and significantly outperforms its predecessor in both audio quality and zero-shot speaker similarity.

📝 Abstract

Streaming zero-shot voice conversion (VC) has become increasingly popular due to its potential for real-time applications. The recently proposed MeanVC achieves lightweight streaming zero-shot VC, but it has several limitations: its chunk-wise autoregressive denoising doubles the effective training sequence length, conversion quality degrades under small-chunk settings, and its timbre encoder directly relies on reference mel-spectrograms, making it sensitive to reference audio quality. To address these limitations we propose MeanVC 2. We introduce future-receptive chunking (FRC), which explicitly schedules past and future receptive fields across diffusion transformer decoder layers and removes clean-chunk teacher forcing. By incorporating bounded future context, FRC enables stable conversion with a 40 ms chunk size. We further introduce a universal timbre token encoder, which constructs a timbre representation from a global speaker embedding and retrieves fine-grained timbre cues via cross-attention, improving robustness to low-quality references and enhancing zero-shot speaker similarity. Experimental results show that MeanVC 2 significantly outperforms MeanVC, while reducing latency from 211 ms to 110 ms. Audio samples are publicly available. The source code will be publicly released.

Problem

Research questions and friction points this paper is trying to address.

streaming voice conversion

zero-shot

low-latency

timbre robustness

chunk-wise processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Future-Receptive Chunking

universal timbre token encoder

streaming zero-shot voice conversion