๐ค AI Summary
Current evaluations of tool-use capabilities lack standardization, and reinforcement learningโbased training suffers from low computational efficiency. This work systematically analyzes key factors in the evaluation pipeline that are highly sensitive to implementation details, revealing how the absence of consistent benchmarks renders leaderboards unreliable in multi-turn settings. To address these issues, the study introduces two novel acceleration techniques for reinforcement learning training that integrate prompt engineering with policy optimization, significantly reducing wall-clock training time without compromising tool-use performance. By jointly enhancing evaluation reliability and training efficiency, this research establishes a more effective and reproducible optimization pathway for developing tool-augmented agents.
๐ Abstract
Tool-calling is a central component of modern large language model (LLM) agents, equipping them with skills beyond their parametric knowledge. This paper studies tool-calling along two complementary axes: effectiveness, i.e., how this capability is measured, and efficiency, i.e., how it is learned. On effectiveness, we systematically analyze tool-calling evaluation pipelines and show that results can be highly sensitive to seemingly minor, often undocumented implementation choices including the random seed, system prompt, multi-turn template construction, and how prior interaction/reasoning history is carried forward. These choices can lead to substantial differences in reported performance, especially in multi-turn settings where without rigorous standardization, leaderboard rankings are unreliable. On efficiency, we examine standard reinforcement learning (RL) for tool-calling and identify two sources of computational waste: (i) during rollouts, many prompts produce no learning signal, and (ii) during policy updates, optimization incurs high computational cost. Guided by these findings, we introduce two techniques that accelerate RL-based tool-calling training, achieving substantial wall-clock speedup without degrading performance.