Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

When applying Generalized Reinforcement Policy Optimization (GRPO) to multi-turn interactive LLM agents for long-horizon reasoning tasks, instability in advantage estimation and policy degradation arise due to token-level optimization misaligned with the hierarchical structure of dialogue. Method: We propose a turn-level Markov Decision Process (MDP) modeling framework, elevating policy optimization granularity from tokens to dialogue turns. Building upon this, we design an enhanced GRPO algorithm integrating turn-level reward attribution, long-term credit assignment, and stabilized advantage estimation. Results: On WebShop and Sokoban benchmarks, our method significantly outperforms standard GRPO—improving success rates on long-reasoning tasks by over 18%, reducing training variance by 32%, and yielding more robust policy convergence. Our core contribution is the first reinforcement learning formulation that enables turn-level modeling and optimization for multi-turn interactive agents, establishing a new paradigm for stable and efficient LLM agent training.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) has re-emerged as a natural approach for training interactive LLM agents in real-world environments. However, directly applying the widely used Group Relative Policy Optimization (GRPO) algorithm to multi-turn tasks exposes notable limitations, particularly in scenarios requiring long-horizon reasoning. To address these challenges, we investigate more stable and effective advantage estimation strategies, especially for multi-turn settings. We first explore Proximal Policy Optimization (PPO) as an alternative and find it to be more robust than GRPO. To further enhance PPO in multi-turn scenarios, we introduce turn-PPO, a variant that operates on a turn-level MDP formulation, as opposed to the commonly used token-level MDP. Our results on the WebShop and Sokoban datasets demonstrate the effectiveness of turn-PPO, both with and without long reasoning components.

Problem

Research questions and friction points this paper is trying to address.

Improves multi-turn RL for agentic LLMs

Addresses limitations of GRPO in long-horizon tasks

Introduces turn-level MDP for stable advantage estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Turn-level MDP formulation for multi-turn RL

Proximal Policy Optimization as robust alternative

Enhanced advantage estimation in agentic LLMs

🔎 Similar Papers

No similar papers found.