🤖 AI Summary
This work addresses fundamental challenges in training large language models (LLMs) as interactive agents via multi-turn reinforcement learning (RL): long-horizon decision instability, training divergence under stochastic environmental feedback, and the failure of reasoning capabilities to emerge. We propose StarPO—a trajectory-level policy optimization framework—and its stabilized variant, StarPO-S. Methodologically, StarPO integrates critic network incorporation, dynamic trajectory filtering, and gradient-decoupled clipping. We first identify and mitigate the “echo trap” phenomenon inherent in multi-turn RL; systematically validate the critical roles of initial-state diversity, interaction granularity, and sampling frequency on rollout quality; and establish that fine-grained, reasoning-aware rewards are essential for chain-of-thought emergence. Empirically, StarPO achieves significantly improved training stability across three simulated environments and demonstrates measurable reasoning capability emergence. We open-source the RAGEN system—including benchmark environments, evaluation protocols, and implementation code—to advance standardization in RL-based LLM agent training.
📝 Abstract
Training large language models (LLMs) as interactive agents presents unique challenges including long-horizon decision making and interacting with stochastic environment feedback. While reinforcement learning (RL) has enabled progress in static tasks, multi-turn agent RL training remains underexplored. We propose StarPO (State-Thinking-Actions-Reward Policy Optimization), a general framework for trajectory-level agent RL, and introduce RAGEN, a modular system for training and evaluating LLM agents. Our study on three stylized environments reveals three core findings. First, our agent RL training shows a recurring mode of Echo Trap where reward variance cliffs and gradient spikes; we address this with StarPO-S, a stabilized variant with trajectory filtering, critic incorporation, and decoupled clipping. Second, we find the shaping of RL rollouts would benefit from diverse initial states, medium interaction granularity and more frequent sampling. Third, we show that without fine-grained, reasoning-aware reward signals, agent reasoning hardly emerge through multi-turn RL and they may show shallow strategies or hallucinated thoughts. Code and environments are available at https://github.com/RAGEN-AI/RAGEN.