A Training-Free Length Extrapolation Approach for LLMs: Greedy Attention Logit Interpolation (GALI)

📅 2025-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Transformer-based large language models suffer from attention miscalibration and sharp performance degradation when processing ultra-long inputs, primarily due to out-of-distribution (O.O.D.) positional encoding. Existing training-free length extrapolation methods face critical limitations—including logit anomalies, loss of fine-grained positional information, or computational inefficiency. To address these issues, we propose GALI, a fine-tuning-free length extrapolation method. GALI introduces greedy attention logit interpolation: it adaptively selects narrow subintervals within the pretrained positional range for interpolation, thereby avoiding logit anomalies while preserving high-fidelity positional modeling. Furthermore, we empirically identify a non-uniform positional modeling pattern in Transformers and establish a new paradigm—“narrow-range extrapolation yields superior performance.” GALI consistently outperforms existing training-free methods across diverse long-context tasks, significantly enhancing both stability and accuracy in beyond-context-length reasoning. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Transformer-based Large Language Models (LLMs) struggle to process inputs exceeding their training context window, with performance degrading due to positional out-of-distribution (O.O.D.) that disrupt attention computations. Existing solutions, fine-tuning and training-free methods, are limited by computational inefficiency, attention logit outliers or loss of local positional information. To address this, we propose Greedy Attention Logit Interpolation (GALI), a training-free length extrapolation method that maximizes the utilization of pretrained positional intervals while avoiding attention logit outliers through attention logit interpolation. The result demonstrates that GALI consistently outperforms state-of-the-art training-free methods. Our findings reveal that LLMs interpret positional intervals unevenly within their training context window, suggesting that extrapolating within a smaller positional interval range yields superior results-even for short-context tasks. GALI represents a significant step toward resolving the positional O.O.D. challenge, enabling more reliable long-text understanding in LLMs. Our implementation of GALI, along with the experiments from our paper, is open-sourced at https://github.com/AcademyCityL/GALI.
Problem

Research questions and friction points this paper is trying to address.

Addresses positional out-of-distribution in LLMs
Proposes training-free length extrapolation method
Enhances long-text understanding in LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free length extrapolation
Greedy Attention Logit Interpolation
Avoids attention logit outliers
🔎 Similar Papers
No similar papers found.
Y
Yan Li
School of Computer Science, The University of Sydney, Sydney, Australia
T
Tianyi Zhang
School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
Zechuan Li
Zechuan Li
Hunan University
Point cloudDeep Learning,3D Object Detection
Soyeon Caren Han
Soyeon Caren Han
University of Melbourne, University of Sydney, Postech
Natural Language ProcessingMultimodal LearningVision and LanguageNatural Language Understanding