3SPO: State-Score-Supervised Policy Optimization for LLM Agents

๐Ÿ“… 2026-06-08
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing reinforcement learning algorithms struggle with credit assignment in multi-turn LLM agent tasks due to sparse, delayed rewards and coarse-grained optimization. This work proposes 3SPO, a novel algorithm that introduces, for the first time, a state-scoring supervision mechanism based on historical success rates to enable fine-grained credit assignment. 3SPO achieves this through progressive policy optimization, adaptive rollouts, and constrained supervised fine-tuningโ€”without requiring value functions or auxiliary models. Theoretically, it exhibits logarithmic allocation regret and provides sample complexity guarantees for action identifiability, score distinguishability, and filtering stability. Empirically, 3SPO outperforms GRPO by +22.6% on ALFWorld and +15.6 points on WebShop, achieving 2.4ร— more state exploration and 1.8ร— faster convergence under comparable computational budgets.
๐Ÿ“ Abstract
Training large language models (LLMs) as autonomous agents via reinforcement learning (RL) has enabled frontier models to achieve superhuman performance in long-horizon tasks. However, existing RL algorithms operate at the trajectory level, performing policy optimization only after collecting complete episode rollouts. This coarse-grained approach faces fundamental challenges in multi-turn agent settings where rewards are sparse, delayed, and credit assignment across individual steps is critical. In this work, we propose \textbf{State-Score-Supervised Policy Optimization (3SPO)}, a novel RL algorithm that performs post-step policy optimization with dynamic state score supervision. At each step, 3SPO computes the state score based on historical success rates, supervising step-wise credit assignment, adaptive rollout and post-step policy optimization without requiring value function estimation or additional auxiliary models. Theoretically, under a per-state bandit abstraction, we show that the proposed score-supervised allocation mechanism achieves logarithmic allocation regret and provide sample-complexity guarantees for action identification, score distinguishability, and filtering stability. Experiments on ALFWorld and WebShop with Qwen2.5-1.5B/7B-Instruct show that 3SPO consistently outperforms GRPO by $+22.6\%$ on ALFWorld and $+15.6$ points on WebShop, while using comparable resources to achieve $2.4\times$ more state exploration and $1.8\times$ faster convergence. Code is available at https://github.com/genalyu/3SPO.
Problem

Research questions and friction points this paper is trying to address.

reinforcement learning
credit assignment
sparse rewards
large language models
policy optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

State-Score-Supervised Policy Optimization
step-wise credit assignment
post-step policy optimization
reinforcement learning for LLM agents
state score supervision
๐Ÿ”Ž Similar Papers
No similar papers found.