NeedleChain: Measuring Intact Long-Context Reasoning Capability of Large Language Models

📅 2025-07-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing NIAH benchmarks inflate estimates of LLMs’ long-context comprehension by embedding irrelevant information, thereby obscuring true reasoning capabilities over purely relevant content. Method: We propose NeedleChain, the first benchmark featuring *fully query-relevant, chain-structured logical dependencies* in long contexts, enabling rigorous evaluation of holistic reasoning over semantically coherent, variable-length sentence chains. To mitigate positional encoding degradation in ultra-long sequences, we introduce a ROPE compression strategy that preserves relative position information while improving long-range dependency modeling. Contribution/Results: Experiments reveal substantial comprehension gaps—even in state-of-the-art models like GPT-4o—when processing fully relevant long texts, exposing critical biases in current evaluation paradigms. Our work identifies fundamental bottlenecks in LLMs’ genuine long-text reasoning and provides an open-source benchmark (code and data) to enable trustworthy, fine-grained assessment of long-context understanding.

Technology Category

Application Category

📝 Abstract
The Needle-in-a-Haystack (NIAH) benchmark is widely used to evaluate Large Language Models' (LLMs) ability to understand long contexts (LC). It evaluates the capability to identify query-relevant context within extensive query-irrelevant passages. Although this method serves as a widely accepted standard for evaluating long-context understanding, our findings suggest it may overestimate the true LC capability of LLMs. We demonstrate that even state-of-the-art models such as GPT-4o struggle to intactly incorporate given contexts made up of solely query-relevant ten sentences. In response, we introduce a novel benchmark, extbf{NeedleChain}, where the context consists entirely of query-relevant information, requiring the LLM to fully grasp the input to answer correctly. Our benchmark allows for flexible context length and reasoning order, offering a more comprehensive analysis of LLM performance. Additionally, we propose an extremely simple yet compelling strategy to improve LC understanding capability of LLM: ROPE Contraction. Our experiments with various advanced LLMs reveal a notable disparity between their ability to process large contexts and their capacity to fully understand them. Source code and datasets are available at https://github.com/hyeonseokk/NeedleChain
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' true long-context reasoning capability
Assessing LLM performance with fully relevant contexts
Improving LC understanding via ROPE Contraction strategy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces NeedleChain benchmark for LLM evaluation
Uses ROPE Contraction to improve understanding
Flexible context length and reasoning order
🔎 Similar Papers
No similar papers found.
Hyeonseok Moon
Hyeonseok Moon
Korea University
Neural Machine TranslationNatural Language Processing
H
Heuiseok Lim
Korea University, Seoul, Republic of Korea