Blurry Window Attention

📅 2026-05-31

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenges of high computational complexity and unbounded growth of KV cache in Transformers under long-context scenarios, as well as the limited performance of existing linear attention models on information retrieval and recall tasks. The authors propose a bounded-memory attention mechanism inspired by state space models, which reconstructs historical KV states via Dirichlet kernel interpolation over a frequency window. This approach generalizes the sliding window into a tunable-resolution fuzzy window and reveals an intrinsic connection to gated slot attention. Maintaining linear computational complexity, the method significantly enhances recall capability—achieving 8× higher state efficiency than sliding-window baselines on the MQAR task—and stands among the few linear attention models in RegBench whose performance improves with increasing state size.

📝 Abstract

The Softmax Attention operation in Transformer language models has a quadratic complexity in the sequence length and a growing state size in the form of KV cache, which becomes a bottleneck in long context scenarios. To overcome this limitation, alternative architectures with linear complexity and finite state size have been introduced, such as State-Space Models (SSMs), Linear Attention (LA), and Attention with Bounded-memory Control (ABC). Though linear models achieve similar language perplexity as Transformers, they are still behind in tasks which require retrieval or recall of specific information. In this work, we introduce Blurry Window Attention (BLA) a novel ABC method inspired by SSMs. BLA stores a frequency window from which a blurry KV history is reconstructed via interpolation using Dirichlet kernels. BLA can be understood as a generalization of Sliding Window Attention (SWA) depending on the Dirichlet kernels resolution or as a special case of the Gated Slot Attention (GSA), where the decay factor is implemented with Dirichlet kernels. We describe in details the theory and efficient implementation of BLA. On the Multi-Query Associate Recall (MQAR) synthetic task, we show that the state efficiency of BLA is 8$\times$ better than SWA and is competitive with popular linear attention models, and in the RegBench synthetic task, only BLA and SWA improve their performance as the state size grows among the linear models we tested.

Problem

Research questions and friction points this paper is trying to address.

Softmax Attention

quadratic complexity

KV cache

long context

information retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

Blurry Window Attention

Linear Attention

State-Space Models