🤖 AI Summary
Large language models (LLMs) struggle to extrapolate to contexts significantly longer than their training sequence length due to the positional encoding and fixed attention span inherent in standard self-attention mechanisms. This work proposes ReAttention—a zero-shot, fine-tuning-free, position-agnostic long-context extension method. Its core innovation is a novel top-k attention preselection mechanism that decouples reliance on positional information while preserving native position-aware attention, enabling theoretically unbounded context length extrapolation. Implemented via Triton-optimized inference, ReAttention incurs no additional computational overhead. Extensive experiments demonstrate competitive performance against state-of-the-art methods on benchmarks such as LongBench. Notably, it extends LLaMA3.1-8B’s supported context to ≥1M tokens and boosts LLaMA3.2-3B-Chat to 4M tokens—achieving a 128× expansion—thereby substantially overcoming existing context-length bottlenecks.
📝 Abstract
The long-context capability of the Large Language Models (LLM) has made significant breakthroughs, but the maximum supported context length in length extrapolation remains a critical bottleneck limiting their practical applications. The constraint of context length in LLMs arises from the self-attention mechanism, which cannot effectively and efficiently capture the semantic relationships within infinitely long contexts via the limited pre-trained positional information and attention scope. In this work, we propose ReAttention, a training-free approach enabling LLM based on the self-attention mechanism to support an infinite context with a finite attention scope under sufficient memory resources. ReAttention performs the position-agnostic top-$k$ attention before the ordinary position-aware self-attention, freeing LLMs from the length extrapolation issue. We validate the performance of ReAttention on the LongBench, L-Eval, and InfiniteBench and demonstrate that it is on par with traditional methods. Furthermore, we also apply ReAttention on mainstream LLMs, including LLaMA3.1-8B and Mistral-v0.3-7B, enabling them to support context lengths of at least 1M and even expanding the context length of LLaMA3.2-3B-chat by 128$ imes$ to 4M without any further training in Needle-In-A-Haystack tests. We also improve the efficiency of ReAttention with Triton and achieve an efficient extrapolation without additional overhead. The code is available at https://github.com/OpenMOSS/ReAttention.