π€ AI Summary
This work addresses the challenges of degraded audio quality, high sensitivity to reference audio quality, and excessive latency in existing streaming zero-shot voice conversion systems under small chunk settings. To overcome these limitations, the authors propose MeanVC 2, which introduces a future-aware chunking (FRC) mechanism to enhance conversion stability with small chunks and incorporates a universal timbre token encoder to improve robustness against low-quality references while preserving speaker similarity. Built upon a diffusion transformer architecture, the model leverages cross-attention and global speaker embeddings to construct an efficient timbre representation. Experimental results demonstrate that MeanVC 2 achieves stable conversion with 40 ms chunks, reduces system latency from 211 ms to 110 ms, and significantly outperforms its predecessor in both audio quality and zero-shot speaker similarity.
π Abstract
Streaming zero-shot voice conversion (VC) has become increasingly popular due to its potential for real-time applications. The recently proposed MeanVC achieves lightweight streaming zero-shot VC, but it has several limitations: its chunk-wise autoregressive denoising doubles the effective training sequence length, conversion quality degrades under small-chunk settings, and its timbre encoder directly relies on reference mel-spectrograms, making it sensitive to reference audio quality. To address these limitations we propose MeanVC 2. We introduce future-receptive chunking (FRC), which explicitly schedules past and future receptive fields across diffusion transformer decoder layers and removes clean-chunk teacher forcing. By incorporating bounded future context, FRC enables stable conversion with a 40 ms chunk size. We further introduce a universal timbre token encoder, which constructs a timbre representation from a global speaker embedding and retrieves fine-grained timbre cues via cross-attention, improving robustness to low-quality references and enhancing zero-shot speaker similarity. Experimental results show that MeanVC 2 significantly outperforms MeanVC, while reducing latency from 211 ms to 110 ms. Audio samples are publicly available. The source code will be publicly released.