🤖 AI Summary
This work addresses the instability and insufficient exploration often encountered when large language models are trained with reinforcement learning using external tools, primarily due to improper tool usage. To mitigate these issues, the authors propose the TAO-RL framework, which jointly designs a tool-aware trajectory filtering mechanism and an entropy-guided exploration strategy. High-quality trajectories are selected based on execution success rates and result consistency, while an entropy-guided adjustment of the advantage function following tool invocations enhances diversity at critical decision points. Experimental results demonstrate that TAO-RL consistently outperforms existing methods across seven complex reasoning benchmarks and three model scales, confirming its effectiveness and scalability.
📝 Abstract
Agentic reinforcement learning (RL) equips large language models (LLMs) with tool-use capabilities that substantially improve reasoning on complex tasks. However, integrating external tools often destabilizes training: over-reliance on tools can induce input distribution shift, while overly conservative tool use limits effective exploration. To address this issue, we propose a unified framework TAO-RL that couples tool-aware trajectory filtering with entropy-guided exploration for efficient policy optimization. Specifically, at the data level, TAO-RL filters rollout trajectories along two criteria: discarding those where all tool invocations fail to execute, and removing those where all rollouts are either correct or incorrect, as both cases yield degenerate advantage estimates that contribute no discriminative learning signal. This joint filtering retains data that are both tool-capable and informative, establishing a high-quality training distribution. At the algorithmic level, we introduce a tool-aware entropy-guided bonus that reshapes the advantage function at post-tool-call tokens, encouraging the policy to explore more diverse reasoning paths at critical decision points. These two components are mutually reinforcing: trajectory filtering establishes a clean and informative training foundation, while entropy-guided exploration drives stronger reasoning behaviors at critical tool-interaction junctures. Extensive experiments on 7 challenging reasoning benchmarks across 3 model scales demonstrate the superiority of TAO-RL over existing methods.