Limits of KV Cache Compression for Tensor Attention based Autoregressive Transformers

📅 2025-03-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
KV caching in autoregressive Transformer inference induces severe memory bottlenecks. Method: We introduce communication complexity reduction techniques into attention analysis to derive tight space-complexity lower bounds for KV cache compression under tensor-based attention, separately for high-dimensional (d = Ω(log n)) and low-dimensional (d = o(log n)) settings; we further integrate information-theoretic arguments with tensor-structured modeling to characterize the fundamental trade-off between compression ratio and representational capacity. Contribution/Results: We establish the first rigorous theoretical lower bound on KV cache compressibility, proving that—under typical parameter configurations—the KV cache is inherently incompressible beyond negligible factors. This work provides the first theoretical benchmark for memory-efficient Transformer design and introduces a novel design paradigm grounded in formal complexity analysis.

Technology Category

Application Category

📝 Abstract
The key-value (KV) cache in autoregressive transformers presents a significant bottleneck during inference, which restricts the context length capabilities of large language models (LLMs). While previous work analyzes the fundamental space complexity barriers in standard attention mechanism [Haris and Onak, 2025], our work generalizes the space complexity barriers result to tensor attention version. Our theoretical contributions rely on a novel reduction from communication complexity and deduce the memory lower bound for tensor-structured attention mechanisms when $d = Omega(log n)$. In the low dimensional regime where $d = o(log n)$, we analyze the theoretical bounds of the space complexity as well. Overall, our work provides a theoretical foundation for us to understand the compression-expressivity tradeoff in tensor attention mechanisms and offers more perspectives in developing more memory-efficient transformer architectures.
Problem

Research questions and friction points this paper is trying to address.

Analyzes space complexity in tensor attention mechanisms.
Explores memory lower bounds for tensor-structured attention.
Investigates compression-expressivity tradeoff in transformers.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generalizes space complexity barriers to tensor attention
Uses communication complexity for memory lower bounds
Analyzes compression-expressivity tradeoff in tensor attention
🔎 Similar Papers
No similar papers found.
Y
Yifang Chen
The University of Chicago
X
Xiaoyu Li
Stevens Institute of Technology
Yingyu Liang
Yingyu Liang
The University of Hong Kong
machine learning
Zhenmei Shi
Zhenmei Shi
Senior Research Scientist at MongoDB + Voyage AI; PhD from University of Wisconsin–Madison
Deep LearningMachine LearningArtificial Intelligence
Z
Zhao Song
The Simons Institute for the Theory of Computing at UC Berkeley
Y
Yu Tian
Independent Researcher