ReAttention: Training-Free Infinite Context with Finite Attention Scope

📅 2024-07-21

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Large language models (LLMs) struggle to extrapolate to contexts significantly longer than their training sequence length due to the positional encoding and fixed attention span inherent in standard self-attention mechanisms. This work proposes ReAttention—a zero-shot, fine-tuning-free, position-agnostic long-context extension method. Its core innovation is a novel top-k attention preselection mechanism that decouples reliance on positional information while preserving native position-aware attention, enabling theoretically unbounded context length extrapolation. Implemented via Triton-optimized inference, ReAttention incurs no additional computational overhead. Extensive experiments demonstrate competitive performance against state-of-the-art methods on benchmarks such as LongBench. Notably, it extends LLaMA3.1-8B’s supported context to ≥1M tokens and boosts LLaMA3.2-3B-Chat to 4M tokens—achieving a 128× expansion—thereby substantially overcoming existing context-length bottlenecks.

Technology Category

Application Category

📝 Abstract

The long-context capability of the Large Language Models (LLM) has made significant breakthroughs, but the maximum supported context length in length extrapolation remains a critical bottleneck limiting their practical applications. The constraint of context length in LLMs arises from the self-attention mechanism, which cannot effectively and efficiently capture the semantic relationships within infinitely long contexts via the limited pre-trained positional information and attention scope. In this work, we propose ReAttention, a training-free approach enabling LLM based on the self-attention mechanism to support an infinite context with a finite attention scope under sufficient memory resources. ReAttention performs the position-agnostic top-$k$ attention before the ordinary position-aware self-attention, freeing LLMs from the length extrapolation issue. We validate the performance of ReAttention on the LongBench, L-Eval, and InfiniteBench and demonstrate that it is on par with traditional methods. Furthermore, we also apply ReAttention on mainstream LLMs, including LLaMA3.1-8B and Mistral-v0.3-7B, enabling them to support context lengths of at least 1M and even expanding the context length of LLaMA3.2-3B-chat by 128$ imes$ to 4M without any further training in Needle-In-A-Haystack tests. We also improve the efficiency of ReAttention with Triton and achieve an efficient extrapolation without additional overhead. The code is available at https://github.com/OpenMOSS/ReAttention.

Problem

Research questions and friction points this paper is trying to address.

Overcomes context length limitation in LLMs

Enables infinite context with finite attention scope

Improves efficiency without additional training overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free infinite context with finite attention

Position-agnostic top-k attention before self-attention

Efficient extrapolation using Triton without overhead

🔎 Similar Papers

A Training-free Sub-quadratic Cost Transformer Model Serving Framework With Hierarchically Pruned Attention