TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the inefficiency in autoregressive text-to-speech (AR-TTS) systems caused by excessively long audio token sequences, which leads to high inference latency and substantial key-value (KV) cache memory consumption. To mitigate this, the authors propose a patch-based autoregressive modeling framework that compresses consecutive audio tokens into compact latent patches, performs causal modeling at the patch level, and reconstructs fine-grained speech tokens via a lightweight extractor. Notably, this is the first effort to integrate patch-level causal modeling into a pretrained AR-TTS system. By freezing the backbone network and incorporating LoRA adaptation alongside a speaker-conditional extractor, the method achieves significant reductions in inference overhead without replacing existing components. Experiments demonstrate that with a patch size of 4, inference speed improves by 1.8× and KV cache memory usage decreases by up to 75%, while maintaining high-quality speech synthesis.

📝 Abstract

Codec-based autoregressive (AR) speech language models have achieved strong text-to-speech (TTS) quality by modeling speech as sequences of discrete audio tokens with large pretrained backbones. However, this token-level formulation creates a structural efficiency bottleneck: speech-token sequences are much longer than text sequences, requiring the AR backbone to perform causal computation at every token position and maintain a KV cache that grows with the sequence length. We introduce TLDR, a patch-based autoregressive framework that accelerates codec-based AR-TTS by shifting the causal modeling from token-level speech sequences to patch-level sequences. TLDR groups consecutive codec tokens into compact latent patches using a lightweight compressor, models the resulting shorter patch sequence with a frozen pretrained AR-TTS backbone adapted by LoRA, and reconstructs fine-grained speech tokens within each patch using a speaker-conditioned extractor. With a patch size of 4, TLDR achieves a 1.8x inference speedup over the baseline AR-TTS model and reduces global KV-cache memory by up to 75%. Experimental results indicate that patch-level global causal modeling can be a practical way to reduce the inference cost of pretrained codec-based AR-TTS systems without replacing the existing modules.

Problem

Research questions and friction points this paper is trying to address.

autoregressive TTS

audio token compression

inference efficiency

KV cache

codec-based speech synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

patch-based autoregressive modeling

audio token compression

codec-based TTS