When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering

📅 2026-01-27

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses whether iterative retrieval-reasoning can surpass static retrieval-augmented generation (RAG) that provides all ideal evidence at once in scientific multi-hop question answering. Through controlled experiments, the authors systematically compare three paradigms—context-free, ideal-evidence static RAG, and iterative RAG—within the chemistry domain, revealing both the efficacy and failure modes of iterative mechanisms. They propose a training-free controller that orchestrates alternating retrieval, hypothesis refinement, and an evidence-aware stopping strategy, and introduce multidimensional diagnostic metrics such as coverage gaps and anchor loss. The work demonstrates, for the first time at the mechanistic level, that iterative RAG can exceed the performance ceiling imposed by ideal evidence, achieving up to a 25.6 percentage point gain on ChemKGMultiHopQA, with particularly pronounced improvements for non-reasoning-finetuned models.

Technology Category

Application Category

📝 Abstract

Retrieval-Augmented Generation (RAG) extends large language models (LLMs) beyond parametric knowledge, yet it is unclear when iterative retrieval-reasoning loops meaningfully outperform static RAG, particularly in scientific domains with multi-hop reasoning, sparse domain knowledge, and heterogeneous evidence. We provide the first controlled, mechanism-level diagnostic study of whether synchronized iterative retrieval and reasoning can surpass an idealized static upper bound (Gold Context) RAG. We benchmark eleven state-of-the-art LLMs under three regimes: (i) No Context, measuring reliance on parametric memory; (ii) Gold Context, where all oracle evidence is supplied at once; and (iii) Iterative RAG, a training-free controller that alternates retrieval, hypothesis refinement, and evidence-aware stopping. Using the chemistry-focused ChemKGMultiHopQA dataset, we isolate questions requiring genuine retrieval and analyze behavior with diagnostics spanning retrieval coverage gaps, anchor-carry drop, query quality, composition fidelity, and control calibration. Across models, Iterative RAG consistently outperforms Gold Context, with gains up to 25.6 percentage points, especially for non-reasoning fine-tuned models. Staged retrieval reduces late-hop failures, mitigates context overload, and enables dynamic correction of early hypothesis drift, but remaining failure modes include incomplete hop coverage, distractor latch trajectories, early stopping miscalibration, and high composition failure rates even with perfect retrieval. Overall, staged retrieval is often more influential than the mere presence of ideal evidence; we provide practical guidance for deploying and diagnosing RAG systems in specialized scientific settings and a foundation for more reliable, controllable iterative retrieval-reasoning frameworks.

Problem

Research questions and friction points this paper is trying to address.

Iterative RAG

Scientific Multi-hop Question Answering

Gold Context

Retrieval-Augmented Generation

Evidence Composition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative RAG

multi-hop reasoning

retrieval-augmented generation

scientific question answering

diagnostic study

🔎 Similar Papers

No similar papers found.

Authors to Follow