SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating

📅 2026-06-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the computational inefficiency of existing web agents in complex information retrieval, often caused by redundant tool invocations and excessive reasoning steps. To mitigate this, the authors propose SlimSearcher, an efficiency-aware training framework that introduces Pareto-efficient trajectory distillation during supervised fine-tuning (SFT) to filter out inefficient behaviors. In the subsequent reinforcement learning (RL) phase, they design an adaptive reward gating mechanism based on population-relative efficiency, which avoids the simplicity bias and reward hacking induced by absolute penalties. Evaluated on long-horizon benchmarks including GAIA, BrowseComp, and XBenchDeepSearch, SlimSearcher reduces tool invocation rounds by 17%–58% while maintaining or even improving task accuracy.

📝 Abstract

Deep research agents have demonstrated remarkable capabilities in complex information-seeking tasks, yet this power comes at a steep computational cost. Driven by accuracy-focused training paradigms, current models adopt brute-force strategies characterized by blind tool dependency and performative reasoning-generating long, redundant trajectories that are far from necessary for resolving these tasks, leading to wasteful tool calls and excessive token consumption. To overcome this efficiency trap, we propose SlimSearcher, a principled framework that pushes the Pareto frontier between accuracy and computational cost across both Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). In the SFT stage, SlimSearcher employs Pareto-efficient filtration to distill trajectories that are both successful and economical, guiding the model toward inherently efficiency-aware search behaviors. During RL, we introduce Adaptive Reward Gating, a dynamic reward-shaping mechanism that evaluates relative tool and token efficiency within a sampled cohort. By cascading these adaptive efficiency metrics with a strict correctness gate, our approach effectively avoids the brevity bias associated with absolute penalties and mitigates reward hacking. Extensive experiments on long-horizon benchmarks, including GAIA, BrowseComp, and XBenchDeepSearch, demonstrate that SlimSearcher reduces average tool-call rounds by 17%-58% while maintaining or improving accuracy.

Problem

Research questions and friction points this paper is trying to address.

training efficiency

web agents

tool usage

token consumption

computational cost

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Reward Gating

Pareto-efficient filtration

efficiency-aware training