FlashLabs Chroma 1.0: A Real-Time End-to-End Spoken Dialogue Model with Personalized Voice Cloning

📅 2026-01-16

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the challenge of maintaining speaker identity consistency in multi-turn end-to-end spoken dialogue systems, a key limitation for personalized user experiences. The authors propose the first open-source, real-time end-to-end speech dialogue model that leverages a speech tokenizer and a neural audio codec to construct high-fidelity speech representations. Central to their approach is a novel 1:2 interleaved text-audio token scheduling mechanism that enables low-latency streaming generation. Evaluated in multi-turn conversations, the method significantly enhances speaker identity preservation, achieving a 10.96% relative improvement in speaker similarity over a human baseline. With a real-time factor (RTF) of 0.43, the system simultaneously delivers high-quality speech synthesis and robust conversational capabilities.

Technology Category

Application Category

📝 Abstract

Recent end-to-end spoken dialogue systems leverage speech tokenizers and neural audio codecs to enable LLMs to operate directly on discrete speech representations. However, these models often exhibit limited speaker identity preservation, hindering personalized voice interaction. In this work, we present Chroma 1.0, the first open-source, real-time, end-to-end spoken dialogue model that achieves both low-latency interaction and high-fidelity personalized voice cloning. Chroma achieves sub-second end-to-end latency through an interleaved text-audio token schedule (1:2) that supports streaming generation, while maintaining high-quality personalized voice synthesis across multi-turn conversations. Our experimental results demonstrate that Chroma achieves a 10.96% relative improvement in speaker similarity over the human baseline, with a Real-Time Factor (RTF) of 0.43, while maintaining strong reasoning and dialogue capabilities. Our code and models are publicly available at https://github.com/FlashLabs-AI-Corp/FlashLabs-Chroma and https://huggingface.co/FlashLabs/Chroma-4B .

Problem

Research questions and friction points this paper is trying to address.

spoken dialogue system

speaker identity preservation

personalized voice cloning

end-to-end speech modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

personalized voice cloning

end-to-end spoken dialogue

real-time speech generation

interleaved token scheduling

speaker identity preservation

🔎 Similar Papers

No similar papers found.