Dynamic Linear Attention

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing linear attention mechanisms employ fixed state-merging strategies that struggle to adapt to the dynamic importance of tokens, often leading to critical information loss and error accumulation. To address this limitation, this work proposes Dynamic Linear Attention (DLA), a framework that enables adaptive segmentation of state boundaries based on information content and selectively merges low-information states. This approach achieves high-fidelity memory modeling under capacity constraints while preserving sub-quadratic computational complexity. By integrating multi-state linear attention with dynamic memory management, DLA demonstrates consistent and significant improvements over state-of-the-art methods across 16 long-context benchmarks, validating its synergistic gains in modeling capacity, generalization, and memory efficiency.

📝 Abstract

The scalability of Large Language Models (LLMs) to long contexts is fundamentally constrained by the quadratic complexity of standard attention, motivating the adoption of linear attention mechanisms with sub-quadratic cost. To improve representation capacity under long contexts, recent approaches organize memory in a multi-state manner. However, existing multi-state linear attention methods rely on fixed state merging policies that cannot adapt to dynamically varying token importance, irreversibly obscuring critical tokens and causing severe error accumulation over long sequences. To address this limitation, we propose DLA, a dynamic memory modeling framework for multi-state linear attention. DLA introduces (i) Information-Aware Dynamic State Merging, which adaptively determines state boundaries based on token-level information variation, preserving high-resolution representations around semantic transitions while aggressively summarizing stable regions, and (ii) Capacity-Bounded Memory Modeling, which maintains a fixed-size, chronologically ordered state cache by selectively merging adjacent low-information states to control memory growth with minimal information loss. We pre-train DLA on two different linear attention models and evaluate on 16 datasets across three categories. Experimental results demonstrate the superiority of DLA over state-of-the-art.

Problem

Research questions and friction points this paper is trying to address.

linear attention

multi-state memory

dynamic token importance

error accumulation

long-context modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Linear Attention

Multi-state Memory

Information-Aware Merging