🤖 AI Summary
This work addresses the inefficiency in autoregressive text-to-speech (AR-TTS) systems caused by excessively long audio token sequences, which leads to high inference latency and substantial key-value (KV) cache memory consumption. To mitigate this, the authors propose a patch-based autoregressive modeling framework that compresses consecutive audio tokens into compact latent patches, performs causal modeling at the patch level, and reconstructs fine-grained speech tokens via a lightweight extractor. Notably, this is the first effort to integrate patch-level causal modeling into a pretrained AR-TTS system. By freezing the backbone network and incorporating LoRA adaptation alongside a speaker-conditional extractor, the method achieves significant reductions in inference overhead without replacing existing components. Experiments demonstrate that with a patch size of 4, inference speed improves by 1.8× and KV cache memory usage decreases by up to 75%, while maintaining high-quality speech synthesis.
📝 Abstract
Codec-based autoregressive (AR) speech language models have achieved strong text-to-speech (TTS) quality by modeling speech as sequences of discrete audio tokens with large pretrained backbones. However, this token-level formulation creates a structural efficiency bottleneck: speech-token sequences are much longer than text sequences, requiring the AR backbone to perform causal computation at every token position and maintain a KV cache that grows with the sequence length. We introduce TLDR, a patch-based autoregressive framework that accelerates codec-based AR-TTS by shifting the causal modeling from token-level speech sequences to patch-level sequences. TLDR groups consecutive codec tokens into compact latent patches using a lightweight compressor, models the resulting shorter patch sequence with a frozen pretrained AR-TTS backbone adapted by LoRA, and reconstructs fine-grained speech tokens within each patch using a speaker-conditioned extractor. With a patch size of 4, TLDR achieves a 1.8x inference speedup over the baseline AR-TTS model and reduces global KV-cache memory by up to 75%. Experimental results indicate that patch-level global causal modeling can be a practical way to reduce the inference cost of pretrained codec-based AR-TTS systems without replacing the existing modules.