Self-Attention at Constant Cost per Token via Symmetry-Aware Taylor Approximation

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the quadratic growth in computational and memory costs of standard self-attention with respect to context length, a key bottleneck in scaling large language models. The authors propose a novel approach based on symmetry-aware Taylor expansion, decomposing self-attention into a chain of symmetric tensor products. By mapping queries and keys through a feedforward transformation into a minimal polynomial kernel feature basis, the method achieves constant per-token computation and memory overhead—regardless of context length—with costs inversely proportional to the attention head dimension. This formulation maintains arbitrary precision while reducing resource consumption by several orders of magnitude, enabling unbounded token generation and substantially lowering infrastructure and energy demands for large models, alongside offering independent theoretical significance.

Technology Category

Application Category

📝 Abstract

The most widely used artificial intelligence (AI) models today are Transformers employing self-attention. In its standard form, self-attention incurs costs that increase with context length, driving demand for storage, compute, and energy that is now outstripping society's ability to provide them. To help address this issue, we show that self-attention is efficiently computable to arbitrary precision with constant cost per token, achieving orders-of-magnitude reductions in memory use and computation. We derive our formulation by decomposing the conventional formulation's Taylor expansion into expressions over symmetric chains of tensor products. We exploit their symmetry to obtain feed-forward transformations that efficiently map queries and keys to coordinates in a minimal polynomial-kernel feature basis. Notably, cost is fixed inversely in proportion to head size, enabling application over a greater number of heads per token than otherwise feasible. We implement our formulation and empirically validate its correctness. Our work enables unbounded token generation at modest fixed cost, substantially reducing the infrastructure and energy demands of large-scale Transformer models. The mathematical techniques we introduce are of independent interest.

Problem

Research questions and friction points this paper is trying to address.

self-attention

computational cost

memory usage

energy demand

context length

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-attention

constant-cost attention

symmetry-aware Taylor approximation