FlashTTS: Fast Streaming TTS with MTP Acceleration and X-pred Mean Flow Distillation

📅 2026-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-speech (TTS) systems struggle to simultaneously achieve low latency and streaming capability due to autoregressive generation or multi-step flow matching. This work proposes FlashTTS, a natively streaming TTS framework that introduces a novel lagging multi-track architecture to eliminate sentence-level buffering. By integrating multi-token parallel prediction (MTP) with an X-pred mean-based flow-matching decoder, FlashTTS enables high-quality non-autoregressive acoustic generation in just two function evaluations. The method reduces first-packet latency to 325 ms—significantly outperforming strong baselines—while preserving excellent zero-shot voice cloning performance and cross-lingual intelligibility.
📝 Abstract
Recent progress in speech dialogue systems requires Text-to-Speech (TTS) models to be faster and more responsive. Modern speech dialogue systems impose two primary requirements on TTS models: low latency and support for streaming inputs and outputs. However, most existing single-codebook LLM-based TTS methods rely on multi-stage pipelines that lack native streaming capabilities. These systems typically suffer from high end-to-end latency due to slow autoregressive prediction and multi-step flow matching. To address these limitations, we propose FlashTTS, an open-source and low-latency streaming TTS framework. FlashTTS introduces a lagged multi-track architecture that natively processes streaming text and speech inputs, thereby eliminating the need for sentence-level buffering. To accelerate acoustic generation, we integrate parallel Multi-Token Prediction (MTP) with an X-pred mean flow matching decoder. This configuration achieves high-fidelity token-to-mel generation in exactly two function evaluations (2-NFE). By jointly optimizing input processing and decoding efficiency, FlashTTS offers a practical foundation for real-time speech dialogue systems. Experiments show that FlashTTS substantially reduces First-Packet Latency to 325ms compared to robust streaming baselines, all while preserving strong zero-shot voice cloning and cross-lingual intelligibility. Speech samples are available. The model code and checkpoints will be released as open source.
Problem

Research questions and friction points this paper is trying to address.

Text-to-Speech
streaming
low latency
autoregressive prediction
flow matching
Innovation

Methods, ideas, or system contributions that make the work stand out.

Streaming TTS
Multi-Token Prediction
Flow Matching
Low Latency
2-NFE Generation
🔎 Similar Papers
Hanke Xie
Hanke Xie
Northwestern Polytechnical University
Audio speech synthesis
X
Xiaming Ren
Audio, Speech and Language Processing Group (ASLP@NPU), School of Software, Northwestern Polytechnical University, Xi’an, China
Dake Guo
Dake Guo
Northwestern Polytechnical University
Speech ProcessingSpeech Synthesis
R
Ruonan You
Audio, Speech and Language Processing Group (ASLP@NPU), School of Software, Northwestern Polytechnical University, Xi’an, China
Wenhao Li
Wenhao Li
Marshall School of Business, University of Southern California and NBER
Asset PricingFinancial IntermediationMacroeconomics
J
Jingbin Hu
Audio, Speech and Language Processing Group (ASLP@NPU), School of Software, Northwestern Polytechnical University, Xi’an, China
Guobin Ma
Guobin Ma
Northwestern Polytechnical University
H
Huakang Chen
Audio, Speech and Language Processing Group (ASLP@NPU), School of Software, Northwestern Polytechnical University, Xi’an, China
K
Kejie Xu
Huawei Technologies Co., Ltd
R
Rui Huang
Huawei Technologies Co., Ltd
W
Weiguo Tan
Huawei Technologies Co., Ltd
X
Xianrong Wang
Huawei Technologies Co., Ltd
L
Lei Xi
Audio, Speech and Language Processing Group (ASLP@NPU), School of Software, Northwestern Polytechnical University, Xi’an, China