Llasa+: Free Lunch for Accelerated and Streaming Llama-Based Speech Synthesis

๐Ÿ“… 2025-08-08
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address high inference latency and poor streaming capability inherent in autoregressive LLM-based text-to-speech (TTS) systems, this paper proposes Llasa+, a novel framework. Methodologically, it introduces three key components: (1) a multi-token prediction (MTP) module operating atop a frozen backbone network to enhance parallel token generation; (2) a lightweight, backbone-derived token verification mechanism to ensure prediction reliability; and (3) a causal decoder enforcing strict streaming synthesis. Trained exclusively on the LibriTTS dataset, Llasa+ achieves 1.48ร— real-time speedup without sacrificing speech qualityโ€”as measured by MOS, WER, and other objective metrics. Furthermore, it demonstrates strong transferability to other LLM-based TTS models. The framework thus bridges the gap between efficient parallel generation and faithful, low-latency streaming synthesis in LLM-driven TTS.

Technology Category

Application Category

๐Ÿ“ Abstract
Recent progress in text-to-speech (TTS) has achieved impressive naturalness and flexibility, especially with the development of large language model (LLM)-based approaches. However, existing autoregressive (AR) structures and large-scale models, such as Llasa, still face significant challenges in inference latency and streaming synthesis. To deal with the limitations, we introduce Llasa+, an accelerated and streaming TTS model built on Llasa. Specifically, to accelerate the generation process, we introduce two plug-and-play Multi-Token Prediction (MTP) modules following the frozen backbone. These modules allow the model to predict multiple tokens in one AR step. Additionally, to mitigate potential error propagation caused by inaccurate MTP, we design a novel verification algorithm that leverages the frozen backbone to validate the generated tokens, thus allowing Llasa+ to achieve speedup without sacrificing generation quality. Furthermore, we design a causal decoder that enables streaming speech reconstruction from tokens. Extensive experiments show that Llasa+ achieves a 1.48X speedup without sacrificing generation quality, despite being trained only on LibriTTS. Moreover, the MTP-and-verification framework can be applied to accelerate any LLM-based model. All codes and models are publicly available at https://github.com/ASLP-lab/LLaSA_Plus.
Problem

Research questions and friction points this paper is trying to address.

Reducing inference latency in LLM-based TTS models
Enabling streaming speech synthesis with autoregressive structures
Mitigating error propagation in multi-token prediction approaches
Innovation

Methods, ideas, or system contributions that make the work stand out.

Plug-and-play Multi-Token Prediction modules
Verification algorithm for error mitigation
Causal decoder for streaming synthesis
๐Ÿ”Ž Similar Papers
No similar papers found.
Wenjie Tian
Wenjie Tian
Northwest Polytechnical University
speech generation
Z
Zhen Ye
Hong Kong University of Science and Technology, Hong Kong, China
Xinfa Zhu
Xinfa Zhu
Northwestern Polytechnical University
speech generation
W
Wei Xue
Hong Kong University of Science and Technology, Hong Kong, China
Hanke Xie
Hanke Xie
Northwestern Polytechnical University
Audio speech synthesis
L
Lei Xie
Northwestern Polytechnical University, Xiโ€™an, China