๐ค AI Summary
Existing reinforcement learning algorithms struggle with credit assignment in multi-turn LLM agent tasks due to sparse, delayed rewards and coarse-grained optimization. This work proposes 3SPO, a novel algorithm that introduces, for the first time, a state-scoring supervision mechanism based on historical success rates to enable fine-grained credit assignment. 3SPO achieves this through progressive policy optimization, adaptive rollouts, and constrained supervised fine-tuningโwithout requiring value functions or auxiliary models. Theoretically, it exhibits logarithmic allocation regret and provides sample complexity guarantees for action identifiability, score distinguishability, and filtering stability. Empirically, 3SPO outperforms GRPO by +22.6% on ALFWorld and +15.6 points on WebShop, achieving 2.4ร more state exploration and 1.8ร faster convergence under comparable computational budgets.
๐ Abstract
Training large language models (LLMs) as autonomous agents via reinforcement learning (RL) has enabled frontier models to achieve superhuman performance in long-horizon tasks. However, existing RL algorithms operate at the trajectory level, performing policy optimization only after collecting complete episode rollouts. This coarse-grained approach faces fundamental challenges in multi-turn agent settings where rewards are sparse, delayed, and credit assignment across individual steps is critical. In this work, we propose \textbf{State-Score-Supervised Policy Optimization (3SPO)}, a novel RL algorithm that performs post-step policy optimization with dynamic state score supervision. At each step, 3SPO computes the state score based on historical success rates, supervising step-wise credit assignment, adaptive rollout and post-step policy optimization without requiring value function estimation or additional auxiliary models. Theoretically, under a per-state bandit abstraction, we show that the proposed score-supervised allocation mechanism achieves logarithmic allocation regret and provide sample-complexity guarantees for action identification, score distinguishability, and filtering stability. Experiments on ALFWorld and WebShop with Qwen2.5-1.5B/7B-Instruct show that 3SPO consistently outperforms GRPO by $+22.6\%$ on ALFWorld and $+15.6$ points on WebShop, while using comparable resources to achieve $2.4\times$ more state exploration and $1.8\times$ faster convergence. Code is available at https://github.com/genalyu/3SPO.