Enhancing Agentic RL with Progressive Reward Shaping and Value-based Sampling Policy Optimization

📅 2025-12-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Agentic RL for tool-augmented LLMs faces two key challenges in long-horizon reasoning: (1) sparse binary rewards provide insufficient guidance for intermediate steps, and (2) intra-group reward homogeneity in GRPO causes advantage degradation, harming sample efficiency and training stability. To address these, we propose VSPO (Value-Smoothed Policy Optimization): (1) a phased, progressive reward shaping mechanism that separately guides tool invocation and answer generation; and (2) value-estimated dynamic sampling and advantage smoothing to mitigate intra-group reward degeneracy. Integrated with LLM-as-a-Judge evaluation, length-aware BLEU, and dense reward shaping, VSPO achieves significant improvements over PPO and GRPO across diverse question-answering benchmarks—accelerating convergence by over 35%, while simultaneously enhancing generalization and training stability.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) empowered with Tool-Integrated Reasoning (TIR) can iteratively plan, call external tools, and integrate returned information to solve complex, long-horizon reasoning tasks. Agentic Reinforcement Learning (Agentic RL) optimizes such models over full tool-interaction trajectories, but two key challenges hinder effectiveness: (1) Sparse, non-instructive rewards, such as binary 0-1 verifiable signals, provide limited guidance for intermediate steps and slow convergence; (2) Gradient degradation in Group Relative Policy Optimization (GRPO), where identical rewards within a rollout group yield zero advantage, reducing sample efficiency and destabilizing training. To address these challenges, we propose two complementary techniques: Progressive Reward Shaping (PRS) and Value-based Sampling Policy Optimization (VSPO). PRS is a curriculum-inspired reward design that introduces dense, stage-wise feedback - encouraging models to first master parseable and properly formatted tool calls, then optimize for factual correctness and answer quality. We instantiate PRS for short-form QA (with a length-aware BLEU to fairly score concise answers) and long-form QA (with LLM-as-a-Judge scoring to prevent reward hacking). VSPO is an enhanced GRPO variant that replaces low-value samples with prompts selected by a task-value metric balancing difficulty and uncertainty, and applies value-smoothing clipping to stabilize gradient updates. Experiments on multiple short-form and long-form QA benchmarks show that PRS consistently outperforms traditional binary rewards, and VSPO achieves superior stability, faster convergence, and higher final performance compared to PPO, GRPO, CISPO, and SFT-only baselines. Together, PRS and VSPO yield LLM-based TIR agents that generalize better across domains.
Problem

Research questions and friction points this paper is trying to address.

Sparse rewards hinder guidance in Agentic RL training
Gradient degradation reduces sample efficiency in policy optimization
Agentic RL struggles with complex long-horizon reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive Reward Shaping introduces dense, stage-wise feedback for tool calls.
Value-based Sampling Policy Optimization selects prompts balancing difficulty and uncertainty.
These techniques enhance Agentic RL for better generalization across domains.
Z
Zhuoran Zhuang
Fliggy Alibaba, Zhejiang, Hangzhou
Y
Ye Chen
Fliggy Alibaba, Zhejiang, Hangzhou
J
Jianghao Su
Fliggy Alibaba, Zhejiang, Hangzhou
Chao Luo
Chao Luo
Shijiazhuang Tiedao University
Ground motionSoil-structure interactionSite response
L
Luhui Liu
Fliggy Alibaba, Zhejiang, Hangzhou
X
Xia Zeng
Fliggy Alibaba, Zhejiang, Hangzhou