NeedleChain: Measuring Intact Long-Context Reasoning Capability of Large Language Models

📅 2025-07-30

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing NIAH benchmarks inflate estimates of LLMs’ long-context comprehension by embedding irrelevant information, thereby obscuring true reasoning capabilities over purely relevant content. Method: We propose NeedleChain, the first benchmark featuring *fully query-relevant, chain-structured logical dependencies* in long contexts, enabling rigorous evaluation of holistic reasoning over semantically coherent, variable-length sentence chains. To mitigate positional encoding degradation in ultra-long sequences, we introduce a ROPE compression strategy that preserves relative position information while improving long-range dependency modeling. Contribution/Results: Experiments reveal substantial comprehension gaps—even in state-of-the-art models like GPT-4o—when processing fully relevant long texts, exposing critical biases in current evaluation paradigms. Our work identifies fundamental bottlenecks in LLMs’ genuine long-text reasoning and provides an open-source benchmark (code and data) to enable trustworthy, fine-grained assessment of long-context understanding.

Technology Category

Application Category

📝 Abstract

The Needle-in-a-Haystack (NIAH) benchmark is widely used to evaluate Large Language Models' (LLMs) ability to understand long contexts (LC). It evaluates the capability to identify query-relevant context within extensive query-irrelevant passages. Although this method serves as a widely accepted standard for evaluating long-context understanding, our findings suggest it may overestimate the true LC capability of LLMs. We demonstrate that even state-of-the-art models such as GPT-4o struggle to intactly incorporate given contexts made up of solely query-relevant ten sentences. In response, we introduce a novel benchmark, extbf{NeedleChain}, where the context consists entirely of query-relevant information, requiring the LLM to fully grasp the input to answer correctly. Our benchmark allows for flexible context length and reasoning order, offering a more comprehensive analysis of LLM performance. Additionally, we propose an extremely simple yet compelling strategy to improve LC understanding capability of LLM: ROPE Contraction. Our experiments with various advanced LLMs reveal a notable disparity between their ability to process large contexts and their capacity to fully understand them. Source code and datasets are available at https://github.com/hyeonseokk/NeedleChain

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' true long-context reasoning capability

Assessing LLM performance with fully relevant contexts

Improving LC understanding via ROPE Contraction strategy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces NeedleChain benchmark for LLM evaluation

Uses ROPE Contraction to improve understanding

Flexible context length and reasoning order

🔎 Similar Papers

No similar papers found.

Authors to Follow