DisCo-Speech: Controllable Zero-Shot Speech Generation with A Disentangled Speech Codec

📅 2025-12-15

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Traditional speech codecs tightly couple timbre and prosody, hindering fine-grained, cross-speaker controllable synthesis driven by large language models (LLMs). To address this, we propose DisCodec—the first three-factor disentangled codec that explicitly separates content, prosody, and timbre into distinct latent subspaces. Our method employs parallel encoders, a hybrid reconstruction loss, joint content-prosody tokenization, LLM-guided prosody continuation generation, and dynamic target timbre injection into the decoder—jointly optimizing disentanglement fidelity and speech reconstruction quality. Experiments demonstrate that DisCodec achieves zero-shot prosody control significantly outperforming all baselines and sets new state-of-the-art (SOTA) results in voice cloning. Crucially, it enables fully disentangled, fine-grained, cross-speaker controllable synthesis while maintaining high-fidelity speech reconstruction.

Technology Category

Application Category

📝 Abstract

Recent codec-based language models~(LMs) have revolutionized text-to-speech~(TTS). However, since standard codecs tightly couple timbre and prosody, continuation-based LMs inevitably replicate this entanglement, hindering independent control. Recent efforts attempt to break this entanglement via codec design, but insufficient decoupling remains a critical bottleneck. To tackle this challenge, we propose DisCo-Speech, a zero-shot controllable TTS framework that enables prosody control and voice cloning via a disentangled speech codec (DisCodec) and an LM-based generator. The core component, DisCodec, contains two core stages: 1) Tri-factor disentanglement, which explicitly factorizes speech into content, prosody, and timbre subspaces via parallel encoders and hybrid losses; and 2) Fusion and reconstruction, which fuses content and prosody into unified content-prosody tokens suitable for LM prediction, while jointly optimizing reconstruction quality to resolve the disentanglement-reconstruction trade-off. With this design, the LM performs prosodic continuation from a style prompt while the decoder handles target timbre injection, enabling flexible zero-shot control. Experiments show that DisCo-Speech matches state-of-the-art voice cloning performance while outperforming baselines in zero-shot prosody control. By resolving the core entanglement at the codec level, DisCo-Speech provides a robust foundation for controllable speech synthesis. Audio samples are available at https://github.com/disco-speech/DisCo-Speech, and the code and weights will be released at the same link.

Problem

Research questions and friction points this paper is trying to address.

Achieves independent prosody and timbre control in speech synthesis

Resolves codec-level entanglement between speech attributes for better controllability

Enables zero-shot voice cloning and prosody transfer without retraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

Disentangled speech codec separates content, prosody, timbre

Parallel encoders and hybrid losses achieve tri-factor disentanglement

LM predicts content-prosody tokens, decoder injects target timbre

🔎 Similar Papers

AccentBox: Towards High-Fidelity Zero-Shot Accent Generation

2024-09-13arXiv.orgCitations: 1

Authors to Follow