🤖 AI Summary
This paper addresses the challenge of achieving ultra-low latency and high speech quality simultaneously in real-time text-to-speech (TTS) synthesis. We propose a fully autoregressive, zero-shot streaming TTS system built upon a three-stage incremental Transformer architecture: (1) phoneme sequence modeling, (2) joint prediction of temporal semantics and phoneme duration, and (3) deep acoustic token generation. To enable truly startup-delay-free, token-by-token streaming, we introduce monotonic alignment constraints and a dynamic lookahead mechanism. Evaluated on a medium-scale dataset, our system achieves a record-low initial latency of 102 ms (on GPU) — the first such result for streaming TTS under these conditions. In both output-streaming and full-streaming modes, it surpasses mainstream baseline models in speech naturalness and MOS scores. This demonstrates the feasibility of high-performance streaming TTS even with limited training data.
📝 Abstract
We present VoXtream, a fully autoregressive, zero-shot streaming text-to-speech (TTS) system for real-time use that begins speaking from the first word. VoXtream directly maps incoming phonemes to audio tokens using a monotonic alignment scheme and a dynamic look-ahead that does not delay onset. Built around an incremental phoneme transformer, a temporal transformer predicting semantic and duration tokens, and a depth transformer producing acoustic tokens, VoXtream achieves, to our knowledge, the lowest initial delay among publicly available streaming TTS: 102 ms on GPU. Despite being trained on a mid-scale 9k-hour corpus, it matches or surpasses larger baselines on several metrics, while delivering competitive quality in both output- and full-streaming settings. Demo and code are available at https://herimor.github.io/voxtream.