MSRS: Evaluating Multi-Source Retrieval-Augmented Generation

📅 2025-08-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing RAG evaluation frameworks predominantly focus on single-source, short-answer scenarios, failing to adequately assess systems’ capabilities in multi-source information integration and coherent long-text generation. Method: We propose the first scalable evaluation framework for Multi-Source Retrieval and Synthesis (MSRS), introducing two new benchmarks—MSRS-Story and MSRS-Meet—to systematically evaluate RAG’s cross-document information fusion and long-answer coherence. Our approach combines sparse and dense retrieval, and employs oracle retrieval experiments to decouple retrieval quality from generative capability. Contribution/Results: Empirical results demonstrate that answer quality is highly dependent on retrieval effectiveness; reasoning-oriented large language models significantly outperform standard LLMs on multi-source synthesis tasks—even under ideal (oracle) retrieval conditions, confirming the robustness of this advantage. This work establishes novel benchmarks and provides critical insights for advancing RAG evaluation paradigms.

Technology Category

Application Category

📝 Abstract
Retrieval-augmented systems are typically evaluated in settings where information required to answer the query can be found within a single source or the answer is short-form or factoid-based. However, many real-world applications demand the ability to integrate and summarize information scattered across multiple sources, where no single source is sufficient to respond to the user's question. In such settings, the retrieval component of a RAG pipeline must recognize a variety of relevance signals, and the generation component must connect and synthesize information across multiple sources. We present a scalable framework for constructing evaluation benchmarks that challenge RAG systems to integrate information across distinct sources and generate long-form responses. Using our framework, we build two new benchmarks on Multi-Source Retrieval and Synthesis: MSRS-Story and MSRS-Meet, representing narrative synthesis and summarization tasks, respectively, that require retrieval from large collections. Our extensive experiments with various RAG pipelines -- including sparse and dense retrievers combined with frontier LLMs -- reveal that generation quality is highly dependent on retrieval effectiveness, which varies greatly by task. While multi-source synthesis proves challenging even in an oracle retrieval setting, we find that reasoning models significantly outperform standard LLMs at this distinct step.
Problem

Research questions and friction points this paper is trying to address.

Evaluating multi-source retrieval-augmented generation systems
Integrating scattered information across multiple sources
Generating long-form responses through synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-source retrieval framework
Long-form response synthesis
Scalable evaluation benchmarks
🔎 Similar Papers
No similar papers found.
R
Rohan Phanse
Yale University
Yijie Zhou
Yijie Zhou
The Chinese University of Hong Kong, Shenzhen
Distributed OptimizationPrivacy Preserving
K
Kejian Shi
Yale University
W
Wencai Zhang
Yale University
Y
Yixin Liu
Yale University
Y
Yilun Zhao
Yale University
Arman Cohan
Arman Cohan
Yale University; Allen Institute for AI
Natural Language ProcessingMachine LearningArtificial Intelligence