RASER: Recoverability-Aware Selective Escalation Router for Multi-Hop Question Answering

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the high computational and token costs of multi-hop question answering systems, which frequently invoke expensive retrieval operations and large language model (LLM) calls, rendering them impractical in budget-constrained settings. The authors propose a lightweight routing mechanism that leverages a single RAG retrieval result and six low-overhead features to decide whether to escalate to more complex retrieval strategies—such as PRUNE or IRCoT—thereby avoiding redundant computations. The designed routers, RASER-2 and RASER-3, require no additional LLM invocations and achieve F1 scores comparable to state-of-the-art methods across six prominent LLMs and three multi-hop QA benchmarks, while using only 41–49% of the tokens consumed by the always-prune baseline. This approach explicitly enables a controllable trade-off between cost and accuracy.

📝 Abstract

Multi-hop question-answering systems often use expensive retrieval on every question. They may decompose the question, run several retrieval rounds, or search through bridge entities before answering. All of these strategies rely on repeated LLM calls to rewrite or decompose the question, which increases extra token cost, and it is not fitting when the LLM budget is tight. However, our analysis shows that lots of multi-hop questions are already answered correctly by a single one-shot RAG, so running an extra retrieval on every question wastes the budget. We introduce RASER (Recoverability-Aware Selective Escalation Router), a family of cheap routers built on one-shot RAG and six features from it. RASER-2 decides whether to stop or escalate to the extra-retrieval action PRUNE. RASER-3 chooses among one-shot RAG, PRUNE, and iterative retrieval IRCoT, using the same features but adding an explicit cost-accuracy trade-off. Neither router makes an extra LLM call to decide. Across six LLMs and three multi-hop QA benchmarks, both routers stay competitive with the other state-of-the-art (SOTA) baselines in F1 while spending only 41-49% of always-prune's tokens and also less than the iterative and decomposition retrieval baselines.

Problem

Research questions and friction points this paper is trying to address.

multi-hop question answering

retrieval cost

LLM budget

token efficiency

recoverability

Innovation

Methods, ideas, or system contributions that make the work stand out.

RASER

multi-hop QA

retrieval routing