ReAttention: Training-Free Infinite Context with Finite Attention Scope

📅 2024-07-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) struggle to extrapolate to contexts significantly longer than their training sequence length due to the positional encoding and fixed attention span inherent in standard self-attention mechanisms. This work proposes ReAttention—a zero-shot, fine-tuning-free, position-agnostic long-context extension method. Its core innovation is a novel top-k attention preselection mechanism that decouples reliance on positional information while preserving native position-aware attention, enabling theoretically unbounded context length extrapolation. Implemented via Triton-optimized inference, ReAttention incurs no additional computational overhead. Extensive experiments demonstrate competitive performance against state-of-the-art methods on benchmarks such as LongBench. Notably, it extends LLaMA3.1-8B’s supported context to ≥1M tokens and boosts LLaMA3.2-3B-Chat to 4M tokens—achieving a 128× expansion—thereby substantially overcoming existing context-length bottlenecks.

Technology Category

Application Category

📝 Abstract
The long-context capability of the Large Language Models (LLM) has made significant breakthroughs, but the maximum supported context length in length extrapolation remains a critical bottleneck limiting their practical applications. The constraint of context length in LLMs arises from the self-attention mechanism, which cannot effectively and efficiently capture the semantic relationships within infinitely long contexts via the limited pre-trained positional information and attention scope. In this work, we propose ReAttention, a training-free approach enabling LLM based on the self-attention mechanism to support an infinite context with a finite attention scope under sufficient memory resources. ReAttention performs the position-agnostic top-$k$ attention before the ordinary position-aware self-attention, freeing LLMs from the length extrapolation issue. We validate the performance of ReAttention on the LongBench, L-Eval, and InfiniteBench and demonstrate that it is on par with traditional methods. Furthermore, we also apply ReAttention on mainstream LLMs, including LLaMA3.1-8B and Mistral-v0.3-7B, enabling them to support context lengths of at least 1M and even expanding the context length of LLaMA3.2-3B-chat by 128$ imes$ to 4M without any further training in Needle-In-A-Haystack tests. We also improve the efficiency of ReAttention with Triton and achieve an efficient extrapolation without additional overhead. The code is available at https://github.com/OpenMOSS/ReAttention.
Problem

Research questions and friction points this paper is trying to address.

Overcomes context length limitation in LLMs
Enables infinite context with finite attention scope
Improves efficiency without additional training overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free infinite context with finite attention
Position-agnostic top-k attention before self-attention
Efficient extrapolation using Triton without overhead
🔎 Similar Papers
No similar papers found.
Xiaoran Liu
Xiaoran Liu
Fudan University
natural language processing
Qipeng Guo
Qipeng Guo
Fudan University
Y
Yuerong Song
School of Computer Science, Fudan University
Z
Zhigeng Liu
School of Computer Science, Fudan University
K
Kai Lv
School of Computer Science, Fudan University
H
Hang Yan
Shanghai AI Laboratory
Linlin Li
Linlin Li
Huawei Noah’s Ark Lab
Q
Qun Liu
Huawei Noah’s Ark Lab
X
Xipeng Qiu
School of Computer Science, Fudan University, Shanghai AI Laboratory