🤖 AI Summary
Existing NIAH benchmarks inflate estimates of LLMs’ long-context comprehension by embedding irrelevant information, thereby obscuring true reasoning capabilities over purely relevant content.
Method: We propose NeedleChain, the first benchmark featuring *fully query-relevant, chain-structured logical dependencies* in long contexts, enabling rigorous evaluation of holistic reasoning over semantically coherent, variable-length sentence chains. To mitigate positional encoding degradation in ultra-long sequences, we introduce a ROPE compression strategy that preserves relative position information while improving long-range dependency modeling.
Contribution/Results: Experiments reveal substantial comprehension gaps—even in state-of-the-art models like GPT-4o—when processing fully relevant long texts, exposing critical biases in current evaluation paradigms. Our work identifies fundamental bottlenecks in LLMs’ genuine long-text reasoning and provides an open-source benchmark (code and data) to enable trustworthy, fine-grained assessment of long-context understanding.
📝 Abstract
The Needle-in-a-Haystack (NIAH) benchmark is widely used to evaluate Large Language Models' (LLMs) ability to understand long contexts (LC). It evaluates the capability to identify query-relevant context within extensive query-irrelevant passages. Although this method serves as a widely accepted standard for evaluating long-context understanding, our findings suggest it may overestimate the true LC capability of LLMs. We demonstrate that even state-of-the-art models such as GPT-4o struggle to intactly incorporate given contexts made up of solely query-relevant ten sentences. In response, we introduce a novel benchmark, extbf{NeedleChain}, where the context consists entirely of query-relevant information, requiring the LLM to fully grasp the input to answer correctly. Our benchmark allows for flexible context length and reasoning order, offering a more comprehensive analysis of LLM performance. Additionally, we propose an extremely simple yet compelling strategy to improve LC understanding capability of LLM: ROPE Contraction. Our experiments with various advanced LLMs reveal a notable disparity between their ability to process large contexts and their capacity to fully understand them. Source code and datasets are available at https://github.com/hyeonseokk/NeedleChain