Scaling Autonomous Agents via Automatic Reward Modeling And Planning

📅 2025-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) face critical bottlenecks in multi-step decision-making and environment-interaction tasks—such as online shopping and scientific reasoning—including sparse environmental feedback, scarcity of labeled data, and the infeasibility of fine-tuning black-box API-based models. To address these challenges, we propose an automated reward modeling and hierarchical planning framework that requires no human annotation. Our method employs an exploratory agent to sample trajectories randomly; a collaborative response-synthesis LLM then generates positive–negative response pairs, constructing intent–response–score triplets for training. We adopt contrastive reward modeling to convert sparse environmental feedback into learnable trajectory scores and integrate this into hierarchical task planning. Crucially, the approach eliminates reliance on API fine-tuning and manual labeling. Empirical evaluation across multiple agent benchmarks demonstrates significant improvements in task completion rate and planning robustness, validating its generalizability and data efficiency in complex, real-world scenarios.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated remarkable capabilities across a range of text-generation tasks. However, LLMs still struggle with problems requiring multi-step decision-making and environmental feedback, such as online shopping, scientific reasoning, and mathematical problem-solving. Unlike pure text data, collecting large-scale decision-making data is challenging. Moreover, many powerful LLMs are only accessible through APIs, which hinders their fine-tuning for agent tasks due to cost and complexity. To address LLM agents' limitations, we propose a framework that can automatically learn a reward model from the environment without human annotations. This model can be used to evaluate the action trajectories of LLM agents and provide heuristics for task planning. Specifically, our approach involves employing one LLM-based agent to navigate an environment randomly, generating diverse action trajectories. Subsequently, a separate LLM is leveraged to assign a task intent and synthesize a negative response alongside the correct response for each trajectory. These triplets (task intent, positive response, and negative response) are then utilized as training data to optimize a reward model capable of scoring action trajectories. The effectiveness and generalizability of our framework are demonstrated through evaluations conducted on different agent benchmarks. In conclusion, our proposed framework represents a significant advancement in enhancing LLM agents' decision-making capabilities. By automating the learning of reward models, we overcome the challenges of data scarcity and API limitations, potentially revolutionizing the application of LLMs in complex and interactive environments. This research paves the way for more sophisticated AI agents capable of tackling a wide range of real-world problems requiring multi-step decision-making.
Problem

Research questions and friction points this paper is trying to address.

Enhance LLM decision-making with automated reward models
Overcome data scarcity in multi-step decision tasks
Optimize LLM agents for complex interactive environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automates reward model learning from environment
Uses LLM-generated trajectories for training
Enhances decision-making in complex scenarios