Lost in the Haystack: Smaller Needles are More Difficult for LLMs to Find

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This paper identifies a systematic deficiency in large language models (LLMs) for locating short, critical information—termed “fine needles”—within long-context question answering: model accuracy degrades and positional bias intensifies as the length of the gold-context segment decreases. Method: We conduct the first systematic investigation into how gold-content size affects LLM retrieval capability, evaluating seven mainstream LLMs—spanning diverse scales and architectures—across general knowledge, biomedical, and mathematical reasoning domains using an enhanced needle-in-a-haystack benchmark. Contribution/Results: Results consistently show that shorter gold contexts not only reduce overall accuracy but also exacerbate positional sensitivity, undermining assumptions underlying current long-context evaluation paradigms. This finding provides critical empirical evidence and methodological caution for designing AI agent systems requiring fine-grained, distributed information integration.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) face significant challenges with needle-in-a-haystack tasks, where relevant information ("the needle") must be drawn from a large pool of irrelevant context ("the haystack"). Previous studies have highlighted positional bias and distractor quantity as critical factors affecting model performance, yet the influence of gold context size has received little attention. We address this gap by systematically studying how variations in gold context length impact LLM performance on long-context question answering tasks. Our experiments reveal that LLM performance drops sharply when the gold context is shorter, i.e., smaller gold contexts consistently degrade model performance and amplify positional sensitivity, posing a major challenge for agentic systems that must integrate scattered, fine-grained information of varying lengths. This pattern holds across three diverse domains (general knowledge, biomedical reasoning, and mathematical reasoning) and seven state-of-the-art LLMs of various sizes and architectures. Our work provides clear insights to guide the design of robust, context-aware LLM-driven systems.

Problem

Research questions and friction points this paper is trying to address.

LLMs struggle with small relevant contexts in large irrelevant data

Gold context length variation impacts LLM question answering performance

Smaller gold contexts degrade performance and increase positional sensitivity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Studies gold context length impact on LLMs

Reveals performance drop with shorter contexts

Tests seven LLMs across diverse domains

🔎 Similar Papers

A Large Language Model Guided Topic Refinement Mechanism for Short Text Modeling

2024-03-26Citations: 2

Are LLMs Good Cryptic Crossword Solvers?

2024-03-15arXiv.orgCitations: 2

Authors to Follow