Self-evolving LLM agents with in-distribution Optimization

📅 2026-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of credit assignment and training instability faced by large language models in sparse-reward environments with only terminal feedback. The authors propose Q-Evolve, a framework that constructs a hybrid off-policy dataset by combining expert demonstrations and agent trajectories to learn an in-distribution critic model. Leveraging advantage estimation, Q-Evolve automatically generates dense process rewards without requiring hindsight relabeling or human annotation, enabling self-evolution of the policy. Its key innovation lies in unifying automatic process reward generation and policy optimization within an in-distribution reinforcement learning paradigm, integrating weighted implicit Q-learning with behavioral proximal policy optimization. A co-evolution mechanism further mitigates distributional shift and enhances training stability. Experiments demonstrate that Q-Evolve significantly outperforms strong baselines on AlfWorld, WebShop, and ScienceWorld, achieving state-of-the-art performance in sample efficiency, robustness, and task success rate.
📝 Abstract
Large Language Models (LLMs) have recently emerged as powerful controllers for interactive agents in complex environments, yet training them to perform reliable long-horizon decision making remains a fundamental challenge. A key difficulty lies in credit assignment: agents often receive delayed rewards only at the end of episodes. In this paper, we propose Q-Evolve, a self-evolving framework for LLM agents that unifies automatic process-reward labeling and policy learning within a principled in-distribution reinforcement learning paradigm. In each evolving iteration, our method learns an in-distribution critic from a hybrid off-policy dataset that combines expert demonstrations with agent-generated trajectories, stabilizing Bellman backups in sparse-reward settings via a weighted Implicit Q-Learning objective. The learned value function is then used to derive step-wise process rewards through advantage estimation, enabling dense and reliable supervision without environment backtracking or human annotation. Leveraging these signals, we perform behavior-proximal policy optimization that evolves the agent over the data used for process reward labeling, allowing iterative self-improvement without exacerbating distribution shift. We evaluate our method on AlfWorld, WebShop, and ScienceWorld, showing Q-Evolve outperforms strong baselines in sample efficiency, robustness, and overall task performance. Our results demonstrate that stable agent self-evolution is achievable through the co-evolution of process-level supervision and policy, both grounded within a shared in-distribution learning loop.
Problem

Research questions and friction points this paper is trying to address.

credit assignment
long-horizon decision making
sparse-reward environments
LLM agents
reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

in-distribution reinforcement learning
self-evolving agents
process reward labeling
Implicit Q-Learning
distribution shift mitigation