Multi-head Temporal Latent Attention

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Transformer self-attention faces memory and latency bottlenecks in long-sequence inference due to linear growth of the key-value (KV) cache. To address this, we propose Multi-head Temporal Latent Attention (MTLA), a novel attention mechanism that compresses the KV cache along the temporal dimension. MTLA employs hypernetworks to dynamically merge temporal KV vectors, introduces low-rank latent representations, and incorporates a stride-aware causal mask to ensure both training parallelism and inference consistency. Crucially, MTLA achieves sublinear growth of KV cache size with respect to sequence length—the first method to do so. Evaluated on multitask benchmarks including speech translation, MTLA matches standard multi-head attention (MHA) in performance. On English–German speech translation, it delivers 5.3× inference speedup and 8.3× GPU memory reduction without sacrificing translation quality.

Technology Category

Application Category

📝 Abstract

While Transformer self-attention offers strong parallelism, the Key-Value (KV) cache grows linearly with sequence length and becomes a bottleneck for inference efficiency. Multi-head latent attention was recently developed to compress the KV cache into a low-rank latent space. This paper proposes Multi-head Temporal Latent Attention (MTLA), which further reduces the KV cache size along the temporal dimension, greatly lowering the memory footprint of self-attention inference. MTLA employs a hyper-network to dynamically merge temporally adjacent KV cache vectors. To address the mismatch between the compressed KV cache and processed sequence lengths, a stride-aware causal mask is proposed to ensure efficient parallel training and consistency with inference behaviour. Experiments across tasks, including speech translation, speech recognition, speech understanding and text summarisation, demonstrate that MTLA achieves competitive performance compared to standard Multi-Head Attention (MHA), while greatly improving inference speed and GPU memory usage. For example, on a English-German speech translation task, MTLA achieves a 5.3x speedup and a reduction in GPU memory usage by a factor of 8.3 compared to MHA, while maintaining translation quality.

Problem

Research questions and friction points this paper is trying to address.

Reduces KV cache size in self-attention for efficiency

Compresses KV cache along temporal dimension dynamically

Ensures training-inference consistency with stride-aware masking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compresses KV cache into low-rank latent space

Dynamically merges temporally adjacent KV cache vectors

Uses stride-aware causal mask for efficient training

🔎 Similar Papers

No similar papers found.