Cost-Aware Retrieval-Augmentation Reasoning Models with Adaptive Retrieval Depth

📅 2025-10-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational overhead and inefficient resource utilization in Retrieval-Augmented Inference (RAI), this paper proposes a cost-aware adaptive retrieval depth control framework. Methodologically, it introduces a dynamic retrieval depth selection mechanism conditioned on both the query and retrieved results; designs a memory- and latency-constrained cost-sensitive advantage function; and employs reinforcement learning—specifically Proximal Policy Optimization (PPO) and Grouped Relative Policy Optimization (GRPO)—for end-to-end resource-aware optimization. Experiments across seven public question-answering benchmarks demonstrate that the approach maintains inference quality while reducing average model latency by 16–20% and improving exact match accuracy by approximately 5%, thereby significantly enhancing the efficiency-effectiveness trade-off in RAI systems.

Technology Category

Application Category

📝 Abstract
Reasoning models have gained significant attention due to their strong performance, particularly when enhanced with retrieval augmentation. However, these models often incur high computational costs, as both retrieval and reasoning tokens contribute substantially to the overall resource usage. In this work, we make the following contributions: (1) we propose a retrieval-augmented reasoning model that dynamically adjusts the length of the retrieved document list based on the query and retrieval results; (2) we develop a cost-aware advantage function for training of efficient retrieval-augmented reasoning models through reinforcement learning; and (3) we explore both memory- and latency-bound implementations of the proposed cost-aware framework for both proximal and group relative policy optimization algorithms. We evaluate our approach on seven public question answering datasets and demonstrate significant efficiency gains, without compromising effectiveness. In fact, we observed that the model latency decreases by ~16-20% across datasets, while its effectiveness increases by ~5% on average, in terms of exact match.
Problem

Research questions and friction points this paper is trying to address.

Dynamically adjusts retrieval depth for efficiency
Develops cost-aware training via reinforcement learning
Reduces latency while maintaining effectiveness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamically adjusts retrieval depth based on query
Uses cost-aware reinforcement learning for training
Implements memory- and latency-bound optimization algorithms
🔎 Similar Papers
No similar papers found.
Helia Hashemi
Helia Hashemi
Senior Researcher, Microsoft
Information RetrievalNatural Language ProcessingMachine Learning
V
Victor Rühle
Microsoft, Cambridge, United Kingdom
S
Saravan Rajmohan
Microsoft, Redmond, WA, United States