Stronger Baselines for Retrieval-Augmented Generation with Long-Context Language Models

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
With the emergence of long-context language models (supporting 128K+ tokens), it remains unclear whether multi-stage retrieval-augmented generation (RAG) still outperforms simpler single-stage alternatives. Method: We propose DOS RAG—a minimalist, end-to-end “retrieve-and-read” baseline that strictly preserves the original paragraph order from retrieval results, without re-ranking, summarization, or graph construction, relying solely on modern embedding models and long-context LMs to directly process retrieved passages. Contribution/Results: On multi-hop, long-context QA benchmarks—including HotpotQA, MuSiQue, and 2WikiMultiHopQA—DOS RAG consistently matches or exceeds state-of-the-art multi-stage RAG methods such as ReadAgent and RAPTOR. This work provides the first systematic empirical validation of order-aware single-stage RAG, establishing DOS RAG as a new standard baseline for long-context RAG research.

Technology Category

Application Category

📝 Abstract
With the rise of long-context language models (LMs) capable of processing tens of thousands of tokens in a single pass, do multi-stage retrieval-augmented generation (RAG) pipelines still offer measurable benefits over simpler, single-stage approaches? To assess this question, we conduct a controlled evaluation for QA tasks under systematically scaled token budgets, comparing two recent multi-stage pipelines, ReadAgent and RAPTOR, against three baselines, including DOS RAG (Document's Original Structure RAG), a simple retrieve-then-read method that preserves original passage order. Despite its straightforward design, DOS RAG consistently matches or outperforms more intricate methods on multiple long-context QA benchmarks. We recommend establishing DOS RAG as a simple yet strong baseline for future RAG evaluations, pairing it with emerging embedding and language models to assess trade-offs between complexity and effectiveness as model capabilities evolve.
Problem

Research questions and friction points this paper is trying to address.

Comparing multi-stage RAG pipelines with simpler single-stage approaches
Evaluating QA tasks under systematically scaled token budgets
Assessing trade-offs between complexity and effectiveness in RAG methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses long-context language models for retrieval
Compares multi-stage RAG pipelines with baselines
Recommends DOS RAG as a strong baseline
🔎 Similar Papers
No similar papers found.
A
Alex Laitenberger
Stanford University, USA
C
Christopher D. Manning
Stanford University, USA
Nelson F. Liu
Nelson F. Liu
Stanford University
Natural Language Processing