VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This paper addresses the challenge of achieving ultra-low latency and high speech quality simultaneously in real-time text-to-speech (TTS) synthesis. We propose a fully autoregressive, zero-shot streaming TTS system built upon a three-stage incremental Transformer architecture: (1) phoneme sequence modeling, (2) joint prediction of temporal semantics and phoneme duration, and (3) deep acoustic token generation. To enable truly startup-delay-free, token-by-token streaming, we introduce monotonic alignment constraints and a dynamic lookahead mechanism. Evaluated on a medium-scale dataset, our system achieves a record-low initial latency of 102 ms (on GPU) — the first such result for streaming TTS under these conditions. In both output-streaming and full-streaming modes, it surpasses mainstream baseline models in speech naturalness and MOS scores. This demonstrates the feasibility of high-performance streaming TTS even with limited training data.

Technology Category

Application Category

📝 Abstract

We present VoXtream, a fully autoregressive, zero-shot streaming text-to-speech (TTS) system for real-time use that begins speaking from the first word. VoXtream directly maps incoming phonemes to audio tokens using a monotonic alignment scheme and a dynamic look-ahead that does not delay onset. Built around an incremental phoneme transformer, a temporal transformer predicting semantic and duration tokens, and a depth transformer producing acoustic tokens, VoXtream achieves, to our knowledge, the lowest initial delay among publicly available streaming TTS: 102 ms on GPU. Despite being trained on a mid-scale 9k-hour corpus, it matches or surpasses larger baselines on several metrics, while delivering competitive quality in both output- and full-streaming settings. Demo and code are available at https://herimor.github.io/voxtream.

Problem

Research questions and friction points this paper is trying to address.

Real-time streaming text-to-speech with minimal initial delay

Zero-shot TTS system that begins speaking immediately

Achieving low latency while maintaining competitive audio quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive streaming TTS system

Monotonic alignment with dynamic look-ahead

Transformer architecture for low latency

🔎 Similar Papers

No similar papers found.