Lost in the Haystack: Smaller Needles are More Difficult for LLMs to Find

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper identifies a systematic deficiency in large language models (LLMs) for locating short, critical information—termed “fine needles”—within long-context question answering: model accuracy degrades and positional bias intensifies as the length of the gold-context segment decreases. Method: We conduct the first systematic investigation into how gold-content size affects LLM retrieval capability, evaluating seven mainstream LLMs—spanning diverse scales and architectures—across general knowledge, biomedical, and mathematical reasoning domains using an enhanced needle-in-a-haystack benchmark. Contribution/Results: Results consistently show that shorter gold contexts not only reduce overall accuracy but also exacerbate positional sensitivity, undermining assumptions underlying current long-context evaluation paradigms. This finding provides critical empirical evidence and methodological caution for designing AI agent systems requiring fine-grained, distributed information integration.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) face significant challenges with needle-in-a-haystack tasks, where relevant information ("the needle") must be drawn from a large pool of irrelevant context ("the haystack"). Previous studies have highlighted positional bias and distractor quantity as critical factors affecting model performance, yet the influence of gold context size has received little attention. We address this gap by systematically studying how variations in gold context length impact LLM performance on long-context question answering tasks. Our experiments reveal that LLM performance drops sharply when the gold context is shorter, i.e., smaller gold contexts consistently degrade model performance and amplify positional sensitivity, posing a major challenge for agentic systems that must integrate scattered, fine-grained information of varying lengths. This pattern holds across three diverse domains (general knowledge, biomedical reasoning, and mathematical reasoning) and seven state-of-the-art LLMs of various sizes and architectures. Our work provides clear insights to guide the design of robust, context-aware LLM-driven systems.
Problem

Research questions and friction points this paper is trying to address.

LLMs struggle with small relevant contexts in large irrelevant data
Gold context length variation impacts LLM question answering performance
Smaller gold contexts degrade performance and increase positional sensitivity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Studies gold context length impact on LLMs
Reveals performance drop with shorter contexts
Tests seven LLMs across diverse domains
O
Owen Bianchi
Center for Alzheimer’s Disease and Related Dementias, NIA, NIH; DataTecnica LLC
M
Mathew J. Koretsky
Center for Alzheimer’s Disease and Related Dementias, NIA, NIH; DataTecnica LLC
M
Maya Willey
Center for Alzheimer’s Disease and Related Dementias, NIA, NIH; DataTecnica LLC
C
Chelsea X. Alvarado
Center for Alzheimer’s Disease and Related Dementias, NIA, NIH; DataTecnica LLC
T
Tanay Nayak
DataTecnica LLC; Johns Hopkins University
N
Nicole Kuznetsov
Center for Alzheimer’s Disease and Related Dementias, NIA, NIH; DataTecnica LLC
Mike A. Nalls
Mike A. Nalls
Founder/consultant with Data Tecnica International + Data science lead at NIH’s Center for Alzheimer
statistical geneticsneurodegenerationdata sciencebiostatisticsgenomics
Faraz Faghri
Faraz Faghri
National Institutes of Health
Computer scienceNeuroscienceHealthAgingComplex diseases
Daniel Khashabi
Daniel Khashabi
Johns Hopkins University
Natural Language ProcessingArtificial IntelligenceMachine Learning