PORTool: Tool-Use LLM Training with Rewarded Tree

📅 2025-10-29

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Existing LLM-based tool-use approaches are trained on static datasets, leading to overfitting to fixed tool-call patterns and poor generalization to dynamic, evolving environments. To address this, we propose a reinforcement learning framework specifically designed for dynamic tool invocation. Our method constructs tree-structured trajectories via multi-rollout sampling to explicitly model multi-path decision-making. We introduce a fine-grained reward mechanism grounded in shared execution steps and a fork-relative advantage function that jointly incorporates step-level and trajectory-level advantages for policy optimization. Evaluated on complex queries spanning 17 diverse tools, our approach achieves substantial improvements in accuracy while reducing the average number of tool invocations per query. These results demonstrate enhanced exploration efficiency, robustness, and strong generalization capability in dynamic tool-use settings.

Technology Category

Application Category

📝 Abstract

Current tool-use large language models (LLMs) are trained on static datasets, enabling them to interact with external tools and perform multi-step, tool-integrated reasoning, which produces tool-call trajectories. However, these models imitate how a query is resolved in a generic tool-call routine, thereby failing to explore possible solutions and demonstrating limited performance in an evolved, dynamic tool-call environment. In this work, we propose PORTool, a reinforcement learning (RL) method that encourages a tool-use LLM to explore various trajectories yielding the correct answer. Specifically, this method starts with generating multiple rollouts for a given query, and some of them share the first few tool-call steps, thereby forming a tree-like structure. Next, we assign rewards to each step, based on its ability to produce a correct answer and make successful tool calls. A shared step across different trajectories receives the same reward, while different steps under the same fork receive different rewards. Finally, these step-wise rewards are used to calculate fork-relative advantages, blended with trajectory-relative advantages, to train the LLM for tool use. The experiments utilize 17 tools to address user queries, covering both time-sensitive and time-invariant topics. We conduct ablation studies to systematically justify the necessity and the design robustness of step-wise rewards. Furthermore, we compare the proposed PORTool with other training approaches and demonstrate significant improvements in final accuracy and the number of tool-call steps.

Problem

Research questions and friction points this paper is trying to address.

Addresses limited exploration in static tool-use LLM training

Proposes reinforcement learning for dynamic tool-call trajectory optimization

Enhances accuracy and efficiency in multi-step tool-integrated reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning for tool-use LLM training

Tree-structured rollouts with step-wise rewards

Fork-relative advantages blended with trajectory advantages

🔎 Similar Papers

ToolGen: Unified Tool Retrieval and Calling via Generation