It's Not You, It's Clipping: A Soft Trust-Region via Probability Smoothing for LLM RL

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

In RLHF methods such as PPO and GRPO, hard ratio clipping induces gradient discontinuities, information loss, and compromises between training stability and policy fidelity. To address this, we propose Probability-Smoothed Policy Optimization (PSPO), which applies label-smoothing–style interpolation to the policy’s output probabilities prior to importance ratio computation, thereby constructing a soft trust region that replaces hard clipping. We theoretically establish that PSPO guarantees stable policy updates while preserving the full gradient signal. Implemented within the GRPO framework as GR-PSPO, our method significantly improves mathematical reasoning performance on the Qwen2.5 series: on GSM8K, accuracy rises to 39.7% (+22.1 points) for the 0.5B model and 59.4% (+21.6 points) for the 1.5B model. Moreover, generated outputs exhibit enhanced logical coherence and interpretability.

Technology Category

Application Category

📝 Abstract

Training large language models (LLMs) with reinforcement learning (RL) methods such as PPO and GRPO commonly relies on ratio clipping to stabilise updates. While effective at preventing instability, clipping discards information and introduces gradient discontinuities. We propose Probability Smoothing Policy Optimisation (PSPO), which smooths the current policy's probabilities toward the old (behaviour) policy before computing the importance ratio, analogous to label smoothing. Unlike clipping, PSPO preserves gradient signal, while interpolation toward the old policy creates a soft trust region that discourages large, destabilising updates, with formal guarantees. We instantiate PSPO within GRPO (GR-PSPO) and fine-tune Qwen2.5-0.5B and Qwen2.5-1.5B on GSM8K, evaluating on GSM8K test and the cross-dataset generalisation on SVAMP, ASDiv, and MATH-500. Relative to unclipped GRPO (single iteration; no data reuse, ratio always = 1), GR-PSPO achieves similar performance but improves the reasoning leading to clearer and more concise responses which are more logical. Compared to clipped GRPO, GR-PSPO substantially improves performance both the 0.5B and 1.5B models, with a boost of over 20% on GSM8K (39.7% vs. 17.6% for 0.5B, 59.4% vs. 37.8% for 1.5B).

Problem

Research questions and friction points this paper is trying to address.

Clipping in RL training discards information and causes gradient discontinuities

Current methods prevent instability but sacrifice important gradient signals

Need a method that stabilizes training while preserving gradient information

Innovation

Methods, ideas, or system contributions that make the work stand out.

Smooths policy probabilities before ratio computation

Creates soft trust region to prevent destabilizing updates

Preserves gradient signal unlike clipping methods

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL