dots.tts Technical Report

📅 2026-06-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenges of ambiguous semantic structure, weak long-range coherence, and insufficient generation robustness in text-to-speech (TTS) synthesis by proposing Seed-TTS, a 2-billion-parameter continuous autoregressive TTS foundation model. The approach constructs a semantically structured continuous speech latent space, introduces a full-history conditional flow matching mechanism, and devises a reward-free self-correction post-training strategy, substantially enhancing speech coherence and quality. Integrated with a multi-objective AudioVAE, CFG-aware MeanFlow distillation, and large-scale multilingual training, the model achieves state-of-the-art performance on Seed-TTS-Eval, attaining a word error rate (WER) of 0.94% and a similarity score (SIM) of 81.0, while supporting efficient inference with a first-packet latency of 54 ms. All code and models are publicly released.

📝 Abstract

We present dots.tts, a 2B-parameter continuous autoregressive text-to-speech (TTS) foundation model that models speech in a continuous latent space. Compared with existing continuous autoregressive models, our key innovations are threefold. First, we train an AudioVAE with multiple objectives to build a semantically structured and prediction-friendly continuous speech space. Second, we use full-history conditioning in the flow-matching head to preserve long-range consistency and reduce drift during generation. Third, we apply reward-free self-corrective post-training to the flow-matching head to further improve robustness and acoustic quality. After being trained on a large-scale multilingual corpus, dots.tts achieves the best average performance on Seed-TTS-Eval, with WERs of 0.94%/1.30%/6.60% and SIM scores of 81.0/77.1/79.5 on the zh/en/zh-hard test sets, respectively. Across other benchmarks, dots.tts also consistently demonstrates open-source state-of-the-art performance, exhibiting strong generation stability, voice cloning ability, and emotional expressiveness. For efficient inference, we further apply CFG-aware MeanFlow distillation, enabling low-latency speech generation with first-packet latencies of 85/54 ms in output streaming and dual-streaming modes, respectively. To facilitate reproducible research and practical deployment, we release the training and inference code, together with the pretrained, post-trained, and MeanFlow-distilled checkpoints, under the Apache 2.0 license.

Problem

Research questions and friction points this paper is trying to address.

text-to-speech

continuous autoregressive modeling

long-range consistency

generation drift

acoustic quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

continuous autoregressive TTS

AudioVAE

flow-matching