DisCo-Speech: Controllable Zero-Shot Speech Generation with A Disentangled Speech Codec

📅 2025-12-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional speech codecs tightly couple timbre and prosody, hindering fine-grained, cross-speaker controllable synthesis driven by large language models (LLMs). To address this, we propose DisCodec—the first three-factor disentangled codec that explicitly separates content, prosody, and timbre into distinct latent subspaces. Our method employs parallel encoders, a hybrid reconstruction loss, joint content-prosody tokenization, LLM-guided prosody continuation generation, and dynamic target timbre injection into the decoder—jointly optimizing disentanglement fidelity and speech reconstruction quality. Experiments demonstrate that DisCodec achieves zero-shot prosody control significantly outperforming all baselines and sets new state-of-the-art (SOTA) results in voice cloning. Crucially, it enables fully disentangled, fine-grained, cross-speaker controllable synthesis while maintaining high-fidelity speech reconstruction.

Technology Category

Application Category

📝 Abstract
Recent codec-based language models~(LMs) have revolutionized text-to-speech~(TTS). However, since standard codecs tightly couple timbre and prosody, continuation-based LMs inevitably replicate this entanglement, hindering independent control. Recent efforts attempt to break this entanglement via codec design, but insufficient decoupling remains a critical bottleneck. To tackle this challenge, we propose DisCo-Speech, a zero-shot controllable TTS framework that enables prosody control and voice cloning via a disentangled speech codec (DisCodec) and an LM-based generator. The core component, DisCodec, contains two core stages: 1) Tri-factor disentanglement, which explicitly factorizes speech into content, prosody, and timbre subspaces via parallel encoders and hybrid losses; and 2) Fusion and reconstruction, which fuses content and prosody into unified content-prosody tokens suitable for LM prediction, while jointly optimizing reconstruction quality to resolve the disentanglement-reconstruction trade-off. With this design, the LM performs prosodic continuation from a style prompt while the decoder handles target timbre injection, enabling flexible zero-shot control. Experiments show that DisCo-Speech matches state-of-the-art voice cloning performance while outperforming baselines in zero-shot prosody control. By resolving the core entanglement at the codec level, DisCo-Speech provides a robust foundation for controllable speech synthesis. Audio samples are available at https://github.com/disco-speech/DisCo-Speech, and the code and weights will be released at the same link.
Problem

Research questions and friction points this paper is trying to address.

Achieves independent prosody and timbre control in speech synthesis
Resolves codec-level entanglement between speech attributes for better controllability
Enables zero-shot voice cloning and prosody transfer without retraining
Innovation

Methods, ideas, or system contributions that make the work stand out.

Disentangled speech codec separates content, prosody, timbre
Parallel encoders and hybrid losses achieve tri-factor disentanglement
LM predicts content-prosody tokens, decoder injects target timbre
🔎 Similar Papers
T
Tao Li
China Mobile Nineverse Artificial Intelligence Technology (Beijing) Co., Ltd., Nineverse Institute of Artificial Intelligence, The State Key Laboratory of Multimedia Information Processing, Peking University, Beijing, China.
W
Wengshuo Ge
China Mobile Nineverse Artificial Intelligence Technology (Beijing) Co., Ltd., Nineverse Institute of Artificial Intelligence, The State Key Laboratory of Multimedia Information Processing, Peking University, Beijing, China.
Z
Zhichao Wang
China Mobile Nineverse Artificial Intelligence Technology (Beijing) Co., Ltd., Nineverse Institute of Artificial Intelligence, The State Key Laboratory of Multimedia Information Processing, Peking University, Beijing, China.
Z
Zihao Cui
China Mobile Nineverse Artificial Intelligence Technology (Beijing) Co., Ltd., Nineverse Institute of Artificial Intelligence, The State Key Laboratory of Multimedia Information Processing, Peking University, Beijing, China.
Yong Ma
Yong Ma
Wuhan University
Infrared image processingremote sensing
Y
Yingying Gao
China Mobile Nineverse Artificial Intelligence Technology (Beijing) Co., Ltd., Nineverse Institute of Artificial Intelligence, The State Key Laboratory of Multimedia Information Processing, Peking University, Beijing, China.
C
Chao Deng
China Mobile Nineverse Artificial Intelligence Technology (Beijing) Co., Ltd., Nineverse Institute of Artificial Intelligence, The State Key Laboratory of Multimedia Information Processing, Peking University, Beijing, China.
S
Shilei Zhang
China Mobile Nineverse Artificial Intelligence Technology (Beijing) Co., Ltd., Nineverse Institute of Artificial Intelligence, The State Key Laboratory of Multimedia Information Processing, Peking University, Beijing, China.
Junlan Feng
Junlan Feng
Chief Scientist at China Mobile Research
Natural LanguageMachine LearningSpeech ProcessingData Mining