Towards Tight Bounds for Streaming Attention

📅 2026-06-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the space complexity of KV cache compression in streaming attention mechanisms by establishing nearly tight upper and lower bounds, substantially narrowing the gap between existing algorithms and the information-theoretic limit. By integrating coresets with diversity-aware construction, polynomial methods, and spatial partitioning techniques, the authors design a near-optimal streaming attention algorithm. Furthermore, they introduce a novel reduction from the INDEX problem augmented with rich side information to derive a high-dimensional geometric estimation lower bound. This study advances the theoretical understanding of KV cache compression and offers a new paradigm for efficient implementation of streaming attention.

📝 Abstract

The attention mechanism is a cornerstone of modern transformer architectures. However, its expressive power comes at the cost of quadratic runtime and linear space usage. In particular, the classical transformer architecture explicitly stores all previously seen input elements (tokens) in order to generate the next one. The problem of implementing a transformer in limited space, known as KV cache compression, has received much interest over the past few years, spurring the development of powerful heuristics. Recent works of Haris et al, COLT'25 and Kochetkova et al, NeurIPS'25, formalized KV cache compression as the streaming attention approximation problem, providing both upper bounds (based on discrepancy theory) and information theoretic lower bounds. However, those papers left open a significant gap between the upper and lower bounds. For example, the space usage of their algorithms increases with the precision parameter, but the lower bound does not get stronger. In this work, we revisit the streaming attention approximation problem and provide nearly tight bounds on its space complexity. On the algorithmic side, we achieve the result through a surprisingly tight interplay between three distinct methods for kernel density estimation: discrepancy-based coreset constructions (e.g., Charikar-Kapralov-Waingarten'24), the polynomial method (e.g., Greengard-Rokhlin'87, Alman-Song'23), and space partitioning (e.g., Andoni-Laarhoven-Razenshteyn-Waingarten'17, Charikar-Kapralov-Nouri-Siminelakis'20). On the lower bound side, our main technical contribution is a new technique for using the INDEX problem with a large amount of side information that we hope will prove useful in other high dimensional geometric estimation problems.

Problem

Research questions and friction points this paper is trying to address.

streaming attention

KV cache compression

space complexity

attention approximation

transformer

Innovation

Methods, ideas, or system contributions that make the work stand out.

streaming attention

KV cache compression

space complexity