When Denser Credit Is Not Enough: Evidence-Calibrated Policy Optimization for Long-Horizon LLM Agent Training

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the challenge of credit assignment in long-horizon language agent tasks under sparse and delayed rewards, where existing dense credit methods suffer from instability due to interference from infrequent, high-impact actions. To mitigate this, the paper proposes Evidence-Calibrated Policy Optimization (ECPO), a critic-free policy optimization algorithm that statistically calibrates step-level credit prior to policy updates. ECPO employs action-group-based credit estimation contraction and variance-aware anchor weighting to suppress bias from low-frequency actions and noise in anchor selection. This approach significantly enhances training stability and task performance. Evaluated on ALFWorld and WebShop using Qwen2.5-1.5B, ECPO outperforms strong baselines such as GiGPO by 5.2 and 7.3 percentage points in success rate, respectively, with only a 0.1% increase in computational overhead.

📝 Abstract

Long-horizon LLM agents require reinforcement learning methods that can assign credit to intermediate decisions under sparse and delayed rewards. Recent group-based methods such as GiGPO improve over GRPO by constructing step-level advantages at repeated anchor states. However, we show that such dense credit can be statistically unreliable: under limited rollouts, rare but lucky actions may receive overly large advantages, producing divergent anchor bias and late-stage training oscillation. We propose Evidence-Calibrated Policy Optimization (ECPO), a critic-free policy optimization algorithm that calibrates step-level credit before policy updates. ECPO combines Evidence-Calibrated Action Advantage, which groups rollouts by canonical actions and shrinks low-count estimates, with Variance-Gated Credit Weighting, which suppresses anchor states dominated by within-action noise. Experiments on ALFWorld and WebShop with Qwen2.5-1.5B/7B show that ECPO consistently outperforms strong baselines, improving GiGPO by +5.2/+7.3 success points on ALFWorld/WebShop with Qwen2.5-1.5B while adding only 0.1% additional advantage-computation overhead.

Problem

Research questions and friction points this paper is trying to address.

credit assignment

long-horizon reinforcement learning

sparse rewards

policy optimization

LLM agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evidence-Calibrated Policy Optimization

credit assignment

long-horizon LLM agents