๐ค AI Summary
When applying Generalized Reinforcement Policy Optimization (GRPO) to multi-turn interactive LLM agents for long-horizon reasoning tasks, instability in advantage estimation and policy degradation arise due to token-level optimization misaligned with the hierarchical structure of dialogue. Method: We propose a turn-level Markov Decision Process (MDP) modeling framework, elevating policy optimization granularity from tokens to dialogue turns. Building upon this, we design an enhanced GRPO algorithm integrating turn-level reward attribution, long-term credit assignment, and stabilized advantage estimation. Results: On WebShop and Sokoban benchmarks, our method significantly outperforms standard GRPOโimproving success rates on long-reasoning tasks by over 18%, reducing training variance by 32%, and yielding more robust policy convergence. Our core contribution is the first reinforcement learning formulation that enables turn-level modeling and optimization for multi-turn interactive agents, establishing a new paradigm for stable and efficient LLM agent training.
๐ Abstract
Reinforcement learning (RL) has re-emerged as a natural approach for training interactive LLM agents in real-world environments. However, directly applying the widely used Group Relative Policy Optimization (GRPO) algorithm to multi-turn tasks exposes notable limitations, particularly in scenarios requiring long-horizon reasoning. To address these challenges, we investigate more stable and effective advantage estimation strategies, especially for multi-turn settings. We first explore Proximal Policy Optimization (PPO) as an alternative and find it to be more robust than GRPO. To further enhance PPO in multi-turn scenarios, we introduce turn-PPO, a variant that operates on a turn-level MDP formulation, as opposed to the commonly used token-level MDP. Our results on the WebShop and Sokoban datasets demonstrate the effectiveness of turn-PPO, both with and without long reasoning components.