Learning When Not to Act: Mitigating Tool Abuse in Agentic Reinforcement Learning

๐Ÿ“… 2026-06-01
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

177K/year
๐Ÿค– AI Summary
This work addresses the tendency of reinforcement learning agents to over-rely on external tools even when problems can be solved through internal reasoning, leading to inefficiency and constrained exploration. To mitigate this, the authors propose the Efficient Agent Policy Optimization (EAPO) framework, which dynamically adjusts tool-use penalties by jointly leveraging problem difficulty and model confidence signals. EAPO enables adaptive tool selection through tool-free trajectory sampling, difficulty-aware reward shaping, and confidence-driven token-level reweighting. Evaluated across nine reasoning benchmarks, EAPO achieves a superior accuracyโ€“efficiency trade-off, outperforming GRPO by 7.27%โ€“10.45% in average performance while reducing tool invocations by 18.33%โ€“24.59%.
๐Ÿ“ Abstract
Agentic reinforcement learning can induce tool abuse, where models overuse external tools even for queries solvable by internal reasoning. Existing approaches mitigate this issue with uniform tool-use penalties or hard limits, which reduce tool frequency but may also suppress useful tool-assisted exploration. We propose EAPO, an Efficient Agentic Policy Optimization framework that learns selective tool use. EAPO introduces tool-free trajectories into each rollout group, applies difficulty-aware reward shaping to penalize redundant tool calls mainly on easier queries, and uses confidence-aware token reweighting to improve policy learning. Across nine mathematical and knowledge-intensive reasoning benchmarks, EAPO consistently improves the accuracy efficiency trade-off on Qwen2.5-3B, Qwen2.5-7B, and Llama3.1-8B. Compared with GRPO, EAPO improves average performance by 10.45%, 7.27%, and 9.69%, while reducing average tool calls by 18.33%, 18.33%, and 24.59%, respectively. These results show that agents can learn when not to use tools without compromising tool-integrated reasoning.
Problem

Research questions and friction points this paper is trying to address.

tool abuse
agentic reinforcement learning
tool overuse
internal reasoning
external tools
Innovation

Methods, ideas, or system contributions that make the work stand out.

selective tool use
difficulty-aware reward shaping
confidence-aware token reweighting
tool-free trajectories
agentic reinforcement learning
๐Ÿ”Ž Similar Papers