RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

📅 2025-04-24

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses fundamental challenges in training large language models (LLMs) as interactive agents via multi-turn reinforcement learning (RL): long-horizon decision instability, training divergence under stochastic environmental feedback, and the failure of reasoning capabilities to emerge. We propose StarPO—a trajectory-level policy optimization framework—and its stabilized variant, StarPO-S. Methodologically, StarPO integrates critic network incorporation, dynamic trajectory filtering, and gradient-decoupled clipping. We first identify and mitigate the “echo trap” phenomenon inherent in multi-turn RL; systematically validate the critical roles of initial-state diversity, interaction granularity, and sampling frequency on rollout quality; and establish that fine-grained, reasoning-aware rewards are essential for chain-of-thought emergence. Empirically, StarPO achieves significantly improved training stability across three simulated environments and demonstrates measurable reasoning capability emergence. We open-source the RAGEN system—including benchmark environments, evaluation protocols, and implementation code—to advance standardization in RL-based LLM agent training.

Technology Category

Application Category

📝 Abstract

Training large language models (LLMs) as interactive agents presents unique challenges including long-horizon decision making and interacting with stochastic environment feedback. While reinforcement learning (RL) has enabled progress in static tasks, multi-turn agent RL training remains underexplored. We propose StarPO (State-Thinking-Actions-Reward Policy Optimization), a general framework for trajectory-level agent RL, and introduce RAGEN, a modular system for training and evaluating LLM agents. Our study on three stylized environments reveals three core findings. First, our agent RL training shows a recurring mode of Echo Trap where reward variance cliffs and gradient spikes; we address this with StarPO-S, a stabilized variant with trajectory filtering, critic incorporation, and decoupled clipping. Second, we find the shaping of RL rollouts would benefit from diverse initial states, medium interaction granularity and more frequent sampling. Third, we show that without fine-grained, reasoning-aware reward signals, agent reasoning hardly emerge through multi-turn RL and they may show shallow strategies or hallucinated thoughts. Code and environments are available at https://github.com/RAGEN-AI/RAGEN.

Problem

Research questions and friction points this paper is trying to address.

Addressing long-horizon decision-making in LLM agents

Overcoming reward variance cliffs in multi-turn RL

Enhancing agent reasoning with fine-grained reward signals

Innovation

Methods, ideas, or system contributions that make the work stand out.

StarPO framework for trajectory-level agent RL

Stabilized StarPO-S with trajectory filtering

Diverse initial states shaping RL rollouts

🔎 Similar Papers

Coevolving with the Other You: Fine-Tuning LLM with Sequential Cooperative Multi-Agent Reinforcement Learning