On the impact of retrieved content representations in RAG Pipelines

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF

career value

149K/year
🤖 AI Summary
This study addresses a critical oversight in current retrieval-augmented generation (RAG) systems: their reliance on human-oriented document representations, which neglect the distinct representational needs of large language models as content consumers. Under fixed retrieval results, the authors systematically evaluate the impact of 14 document representation strategies—including selection, summarization, and rewriting—on question-answering accuracy across four generative models. Introducing answer retention rate as a novel metric to assess whether transformed documents preserve the correct answer, controlled experiments reveal for the first time that answer retention is the primary driver of generation accuracy, challenging prior assumptions that attributed performance gains to specific representational mechanisms. Notably, when answer retention is high, variations in wording, structure, length, or query dependence exert minimal influence on accuracy, underscoring that preserving answer information outweighs representational form.
📝 Abstract
Retrieval-Augmented Generation (RAG) supplements a language model's input with retrieved documents, yet most RAG pipelines inherit retrieval components designed for human readers. How retrieved content should be represented when the consumer is a large language model (LLM) rather than a human is less well understood. Recent work has proposed transformations of retrieved content and identified properties that affect generation, but each examines a single transformation or property in isolation, leaving open which features of a document's representation matter most. We address this with a controlled comparison: holding retrieval fixed, we vary only the representation of retrieved documents, comparing an original baseline against thirteen transformations spanning selection, summarisation, and reformulation, in query-dependent and query-independent variants. Across these fourteen representations we measure question-answering accuracy for four generators, and for each representation we also measure answer retention: whether a known answer-bearing document still supports its answer after transformation. We find that answer retention is the primary determinant of generator accuracy; notably, when retention is high, a representation's wording, structure, length, and query-dependence have limited effect. This suggests that accuracy gains attributed to specific mechanisms in prior work may be partly explained by how well those mechanisms preserve answer-bearing content, an attribution that cannot be settled without controlling for retention.
Problem

Research questions and friction points this paper is trying to address.

Retrieval-Augmented Generation
retrieved content representation
large language model
answer retention
document transformation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-Augmented Generation
content representation
answer retention
controlled comparison
large language models
🔎 Similar Papers
No similar papers found.