How Much Reasoning Do Retrieval-Augmented Models Add beyond LLMs? A Benchmarking Framework for Multi-Hop Inference over Hybrid Knowledge

πŸ“… 2026-02-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of large language models in multi-hop reasoning and tasks requiring up-to-date knowledge, where performance is often confounded by memorization from pretraining data. To disentangle genuine retrieval-augmented reasoning from memorization effects, we introduce HybridRAG-Benchβ€”a novel multi-hop question answering benchmark that uniquely supports hybrid knowledge sources (unstructured text and structured knowledge graphs), enforces temporal control, and incorporates contamination awareness. The framework automatically aligns heterogeneous knowledge representations and generates explicit, traceable reasoning paths grounded in scientific literature, while enabling customization across domains and timeframes. Experiments in artificial intelligence, governance policy, and bioinformatics demonstrate its effectiveness in evaluating true retrieval and reasoning capabilities. The code and dataset are publicly released.

Technology Category

Application Category

πŸ“ Abstract
Large language models (LLMs) continue to struggle with knowledge-intensive questions that require up-to-date information and multi-hop reasoning. Augmenting LLMs with hybrid external knowledge, such as unstructured text and structured knowledge graphs, offers a promising alternative to costly continual pretraining. As such, reliable evaluation of their retrieval and reasoning capabilities becomes critical. However, many existing benchmarks increasingly overlap with LLM pretraining data, which means answers or supporting knowledge may already be encoded in model parameters, making it difficult to distinguish genuine retrieval and reasoning from parametric recall. We introduce HybridRAG-Bench, a framework for constructing benchmarks to evaluate retrieval-intensive, multi-hop reasoning over hybrid knowledge. HybridRAG-Bench automatically couples unstructured text and structured knowledge graph representations derived from recent scientific literature on arXiv, and generates knowledge-intensive question-answer pairs grounded in explicit reasoning paths. The framework supports flexible domain and time-frame selection, enabling contamination-aware and customizable evaluation as models and knowledge evolve. Experiments across three domains (artificial intelligence, governance and policy, and bioinformatics) demonstrate that HybridRAG-Bench rewards genuine retrieval and reasoning rather than parametric recall, offering a principled testbed for evaluating hybrid knowledge-augmented reasoning systems. We release our code and data at github.com/junhongmit/HybridRAG-Bench.
Problem

Research questions and friction points this paper is trying to address.

retrieval-augmented models
multi-hop reasoning
hybrid knowledge
benchmark contamination
parametric recall
Innovation

Methods, ideas, or system contributions that make the work stand out.

HybridRAG-Bench
multi-hop reasoning
hybrid knowledge
retrieval-augmented generation
contamination-aware evaluation