Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Language models face dual challenges in long-context modeling: quadratic computational complexity and poor length generalization. This paper proposes a novel architecture based on block-sparse attention, systematically identifying and validating three core design principles: (1) a nonlinear chunk encoder with a [CLS] token to produce retrievable global representations; (2) a bypass residual path enabling stable fusion of local and global information; and (3) explicit sparsity constraints during pretraining to mitigate train-inference distribution shift. Through ablation studies, theoretical analysis, and comparisons with sliding-window and state-space models—all within a unified framework—the functional roles of each component are rigorously validated. Remarkably, the method achieves zero-shot extrapolation from a 4K-token training context to 32M tokens without fine-tuning—the first such result—and attains state-of-the-art performance on the RULER and BABILong benchmarks.

Technology Category

Application Category

📝 Abstract

Effectively processing long contexts is a critical challenge for language models. While standard Transformers are limited by quadratic complexity and poor length extrapolation, alternative architectures like sliding window attention and state space models sacrifice the ability to effectively utilize the full context due to their fixed-size memory. Chunk-based sparse attention has emerged as a promising paradigm for extreme length generalization, yet the key architectural principles underpinning its success are not yet fully understood. In this work, we present a systematic dissection of these models to identify the core components driving their performance. Through a unified framework and comprehensive ablation studies, we demonstrate that a combination of three design principles is critical: (1) an expressive, non-linear Chunk Encoder with a dedicated CLS token to produce representations for retrieval; (2) a Bypassing Residual Path to stably integrate retrieved global information without it being overridden by the local residual stream; and (3) enforced selection sparsity during pre-training to bridge the train-test distribution gap. We provide a theoretical motivation for intra-chunk information processing and landmark generation. By combining these principles, we establish a new state-of-the-art for training-free length extrapolation, successfully generalizing models trained on a 4K context to 32 million tokens on RULER and BABILong. Our findings provide a clear and empirically-grounded set of design principles for developing future, highly-capable long-context language models.

Problem

Research questions and friction points this paper is trying to address.

Improving length generalization in hierarchical sparse attention models

Identifying key architectural principles for effective long-context processing

Bridging train-test distribution gap through specific design components

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chunk Encoder with CLS token for representation

Bypassing Residual Path for global integration

Enforced selection sparsity during pre-training

🔎 Similar Papers

How Well Can a Long Sequence Model Model Long Sequences? Comparing Architechtural Inductive Biases on Long-Context Abilities