π€ AI Summary
This work addresses the challenge that large language models struggle to effectively integrate cross-document evidence in multi-hop question answering. To overcome this limitation, the authors propose PARΒ²-RAG, a two-stage framework that first constructs a high-recall evidence frontier through breadth-first retrieval and then iteratively refines this evidence via depth-first reasoning while dynamically assessing sufficiency. By decoupling retrieval coverage from reasoning decisions, the approach simultaneously achieves high recall and adaptive inference, thereby avoiding premature commitment to low-recall retrieval paths and mitigating the drawbacks of static query formulations. Evaluated on four multi-hop QA benchmarks, PARΒ²-RAG substantially outperforms existing methods, yielding up to a 23.5% absolute improvement in accuracy over IRCoT and a 10.5% gain in NDCG retrieval metrics.
π Abstract
Large language models (LLMs) remain brittle on multi-hop question answering (MHQA), where answering requires combining evidence across documents through retrieval and reasoning. Iterative retrieval systems can fail by locking onto an early low-recall trajectory and amplifying downstream errors, while planning-only approaches may produce static query sets that cannot adapt when intermediate evidence changes. We propose \textbf{Planned Active Retrieval and Reasoning RAG (PAR$^2$-RAG)}, a two-stage framework that separates \emph{coverage} from \emph{commitment}. PAR$^2$-RAG first performs breadth-first anchoring to build a high-recall evidence frontier, then applies depth-first refinement with evidence sufficiency control in an iterative loop. Across four MHQA benchmarks, PAR$^2$-RAG consistently outperforms existing state-of-the-art baselines, compared with IRCoT, PAR$^2$-RAG achieves up to \textbf{23.5\%} higher accuracy, with retrieval gains of up to \textbf{10.5\%} in NDCG.