🤖 AI Summary
This work proposes a unified multilingual speech synthesis foundation model capable of supporting diverse generation tasks, including high-fidelity synthesis, controllable prosody, zero-shot voice cloning, and natural language-driven TTS. Built upon a hierarchical diffusion–autoregressive hybrid architecture and a unified sequence representation, the model integrates 30 languages and 9 Chinese dialects within a single backbone for the first time. It introduces an asymmetric AudioVAE—encoding at 16 kHz and reconstructing at 48 kHz—to achieve implicit super-resolution without relying on external discrete speech tokenizers. Trained on over 2 million hours of data with 2 billion parameters, the model attains state-of-the-art or competitive performance on public zero-shot and instruction-following TTS benchmarks, achieving an average word error rate of 1.68% on an internal 30-language test set. Code, models, and inference tools are publicly released.
📝 Abstract
We present VoxCPM2, a https://info.arxiv.org/help/prep#abstractsfully open-source multilingual and controllable speech generation foundation model that extends the hierarchical diffusion-autoregressive modeling paradigm of VoxCPM. VoxCPM2 advances the framework in three key dimensions: (i) capability, by unifying 30 languages, 9 Chinese dialects, natural-language voice design, style-controllable voice cloning, and high-fidelity continuation cloning within a single backbone; (ii) quality, through an asymmetric AudioVAE that encodes at 16 kHz and reconstructs at 48 kHz, enabling implicit super-resolution with high encoding efficiency; and (iii) scale, by jointly scaling the model to 2B parameters and the training data to over 2 million hours of multilingual speech. To support these diverse capabilities within one model, we introduce a unified sequence organization that expresses all generation modes through different arrangements of the same input building blocks, allowing joint training under a single set of parameters and objective. VoxCPM2 achieves state-of-the-art or competitive performance on public zero-shot and instruction-following TTS benchmarks. On our internal 30-language evaluation set, it attains an average WER of 1.68%. These results demonstrate that hierarchical continuous-latent modeling, without relying on any external discrete speech tokenizer, offers a viable and powerful foundation for large-scale multilingual and controllable speech generation. The model weights, fine-tuning code, and inference tools are publicly released under the Apache 2.0 license to foster community research and development.