A Training-Free Length Extrapolation Approach for LLMs: Greedy Attention Logit Interpolation (GALI)

📅 2025-02-04

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Transformer-based large language models suffer from attention miscalibration and sharp performance degradation when processing ultra-long inputs, primarily due to out-of-distribution (O.O.D.) positional encoding. Existing training-free length extrapolation methods face critical limitations—including logit anomalies, loss of fine-grained positional information, or computational inefficiency. To address these issues, we propose GALI, a fine-tuning-free length extrapolation method. GALI introduces greedy attention logit interpolation: it adaptively selects narrow subintervals within the pretrained positional range for interpolation, thereby avoiding logit anomalies while preserving high-fidelity positional modeling. Furthermore, we empirically identify a non-uniform positional modeling pattern in Transformers and establish a new paradigm—“narrow-range extrapolation yields superior performance.” GALI consistently outperforms existing training-free methods across diverse long-context tasks, significantly enhancing both stability and accuracy in beyond-context-length reasoning. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Transformer-based Large Language Models (LLMs) struggle to process inputs exceeding their training context window, with performance degrading due to positional out-of-distribution (O.O.D.) that disrupt attention computations. Existing solutions, fine-tuning and training-free methods, are limited by computational inefficiency, attention logit outliers or loss of local positional information. To address this, we propose Greedy Attention Logit Interpolation (GALI), a training-free length extrapolation method that maximizes the utilization of pretrained positional intervals while avoiding attention logit outliers through attention logit interpolation. The result demonstrates that GALI consistently outperforms state-of-the-art training-free methods. Our findings reveal that LLMs interpret positional intervals unevenly within their training context window, suggesting that extrapolating within a smaller positional interval range yields superior results-even for short-context tasks. GALI represents a significant step toward resolving the positional O.O.D. challenge, enabling more reliable long-text understanding in LLMs. Our implementation of GALI, along with the experiments from our paper, is open-sourced at https://github.com/AcademyCityL/GALI.

Problem

Research questions and friction points this paper is trying to address.

Addresses positional out-of-distribution in LLMs

Proposes training-free length extrapolation method

Enhances long-text understanding in LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free length extrapolation

Greedy Attention Logit Interpolation

Avoids attention logit outliers

🔎 Similar Papers

ReAttention: Training-Free Infinite Context with Finite Attention Scope