Don't Read Everything: A Curvature-Conditioned Query for Linear Attention

📅 2026-05-31

📈 Citations: 0

✨ Influential: 0

career value

241K/year

🤖 AI Summary

This work addresses the dilution of critical information in linear attention models during long-context tasks, which arises from uniform weighting of historical key-value pairs in the read phase. To mitigate this, the authors propose the Curvature-Conditioned Query (CCQ) mechanism, which dynamically refines queries by leveraging the geometric properties of the softmax function through a local quadratic approximation. Specifically, CCQ employs a recursive estimate of key covariance and a second-order Taylor expansion of the log-partition function to achieve efficient, low-overhead query contraction. As it modifies only the read step, CCQ can be seamlessly integrated into any linear attention backbone. Experiments on GLA and Gated DeltaNet demonstrate substantial improvements in perplexity, zero-shot accuracy, S-NIAH retrieval performance, length extrapolation (from 4K to 20K), and LongBench scores, with minimal computational overhead.

📝 Abstract

Linear attention reduces the quadratic cost of softmax attention by maintaining a recurrent fast-weight state, but it consistently lags on in-context retrieval and long-context tasks. Existing remedies act on the write side of memory through gating, delta updates, or kernel feature maps, but the read step is left unchanged: every past key contributes additively to the output, so useful targets are diluted by the bulk of stored vectors. We borrow one specific piece of softmax's geometry to construct a cheap read-time contraction of the query. A second-order Taylor expansion of the softmax log-partition at the isotropic-attention point gives a local quadratic model whose curvature coincides with the running key covariance, a quantity that can be maintained with the same recurrent/chunkwise mechanism as the linear-attention state. The associated linear operator contracts the query along the high-density directions of memory before it reads the state. We call this mechanism Curvature-Conditioned Query (CCQ). CCQ modifies only the read step and is composable with any linear-attention backbone. Attached to GLA and Gated DeltaNet, it improves perplexity, zero-shot downstream accuracy, S-NIAH retrieval at and beyond the training context, length-extrapolation perplexity from 4K to 20K, and LongBench accuracy, at small extra cost.

Problem

Research questions and friction points this paper is trying to address.

linear attention

in-context retrieval

long-context tasks

memory read

attention dilution

Innovation

Methods, ideas, or system contributions that make the work stand out.

Curvature-Conditioned Query

Linear Attention

Key Covariance