Synthesize and Reward -- Reinforcement Learning for Multi-Step Tool Use in Live Environments

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Training large language models for multi-step tool use faces three key challenges: the high cost of constructing realistic environments, synthetic queries that are misaligned with server states leading to invocation failures, and redundant tool calls induced by recall-based reward mechanisms. To address these issues, this work proposes the PROVE framework, which innovatively integrates a state-isolated real MCP server environment, dependency-graph-guided trajectory generation for state alignment, and a programmatic multi-component reward mechanism—encompassing validity, coverage, and efficiency penalties—without requiring external evaluators. Evaluated on BFCL Multi-Turn, tau2-bench, and T-Eval, PROVE achieves performance gains of 10.2, 6.8, and 6.5 points, respectively, substantially improving both accuracy and efficiency in multi-step tool use while demonstrating consistent improvements across diverse model families.

📝 Abstract

Training LLMs to orchestrate multi-step tool calls is held back by three coupled obstacles: realistic stateful execution environments are costly to build, synthetic training queries are often detached from the server's actual state (so the generated tool calls fail to execute), and recall-based RL rewards incentivize verbose tool-calling patterns. We present PROVE (Programmatic Rewards On Verified Environments), a framework with three contributions: (1) a library of 20 stateful MCP (Model Context Protocol) servers exposing 343 tools, enabling live-execution RL training with session-scoped state isolation; (2) an automated data synthesis pipeline that generates validated multi-turn tool-call trajectories against these servers via dependency-graph-guided conversation simulation grounded in live-sampled server state, so every generated query references entities that actually exist; and (3) a multi-component programmatic reward - graduated validity scoring, dependency-aware coverage, an adaptive efficiency penalty with a complexity-scaled call budget, a tool-name signal, and an argument-value matching bonus - requiring no external judge model. We train four models (Qwen3-4B, Qwen3-8B, Qwen2.5-7B, Granite-4.1-8B) with GRPO using identical reward hyperparameters and ~13K training examples; only learning rate is tuned per model family from a three-point sweep. On BFCL Multi-Turn, tau2-bench, and T-Eval, PROVE yields improvements of up to +10.2, +6.8, and +6.5 points respectively, demonstrating that a compact programmatic reward yields consistent gains on multi-step tool orchestration across two model families.

Problem

Research questions and friction points this paper is trying to address.

multi-step tool use

reinforcement learning

stateful execution environments

synthetic data generation

tool-call orchestration

Innovation

Methods, ideas, or system contributions that make the work stand out.

programmatic reward

stateful tool execution

dependency-graph-guided synthesis