🤖 AI Summary
This paper identifies a systematic deficiency in large language models (LLMs) for locating short, critical information—termed “fine needles”—within long-context question answering: model accuracy degrades and positional bias intensifies as the length of the gold-context segment decreases.
Method: We conduct the first systematic investigation into how gold-content size affects LLM retrieval capability, evaluating seven mainstream LLMs—spanning diverse scales and architectures—across general knowledge, biomedical, and mathematical reasoning domains using an enhanced needle-in-a-haystack benchmark.
Contribution/Results: Results consistently show that shorter gold contexts not only reduce overall accuracy but also exacerbate positional sensitivity, undermining assumptions underlying current long-context evaluation paradigms. This finding provides critical empirical evidence and methodological caution for designing AI agent systems requiring fine-grained, distributed information integration.
📝 Abstract
Large language models (LLMs) face significant challenges with needle-in-a-haystack tasks, where relevant information ("the needle") must be drawn from a large pool of irrelevant context ("the haystack"). Previous studies have highlighted positional bias and distractor quantity as critical factors affecting model performance, yet the influence of gold context size has received little attention. We address this gap by systematically studying how variations in gold context length impact LLM performance on long-context question answering tasks. Our experiments reveal that LLM performance drops sharply when the gold context is shorter, i.e., smaller gold contexts consistently degrade model performance and amplify positional sensitivity, posing a major challenge for agentic systems that must integrate scattered, fine-grained information of varying lengths. This pattern holds across three diverse domains (general knowledge, biomedical reasoning, and mathematical reasoning) and seven state-of-the-art LLMs of various sizes and architectures. Our work provides clear insights to guide the design of robust, context-aware LLM-driven systems.