On the optimization dynamics of RLVR: Gradient gap and step size thresholds

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This paper addresses the lack of theoretical foundations for Verifiable Reward Reinforcement Learning (RLVR) in post-training large language models. We systematically analyze its trajectory-level and token-level optimization dynamics. First, we introduce the novel concept of “gradient gap” to quantitatively characterize the directional bias in policy updates under binary feedback, enabling a convergence theory that reveals a sharp phase transition—between convergence and collapse—governed by a critical step-size threshold. Our theoretical analysis formally justifies empirical practices such as length normalization and predicts how response length and success rate influence the critical step size. Through bandit simulations and GRPO experiments on Qwen2.5-7B, we empirically validate the predicted stagnation behavior and convergence boundaries. This work establishes the first interpretable and verifiable theoretical framework for RLVR, bridging theory and practice in reward-guided LLM alignment.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning with Verifiable Rewards (RLVR), which uses simple binary feedback to post-train large language models, has shown significant empirical success. However, a principled understanding of why it works has been lacking. This paper builds a theoretical foundation for RLVR by analyzing its training process at both the full-response (trajectory) and token levels. Central to our analysis is a quantity called the Gradient Gap, which formalizes the direction of improvement from low-reward to high-reward regions of the response space. We prove that convergence critically depends on aligning the update direction with this Gradient Gap. Moreover, we derive a sharp step-size threshold based on the magnitude of the Gradient Gap: below it, learning converges, whereas above it, performance collapses. Our theory further predicts how the critical step size must scale with response length and the success rate, thereby explaining why practical heuristics such as length normalization improve stability and showing that, with a fixed learning rate, the success rate can stagnate strictly below $100%$. We validate these predictions through controlled bandit simulations and LLM experiments, including training Qwen2.5-7B with GRPO.

Problem

Research questions and friction points this paper is trying to address.

Analyzing convergence conditions for RLVR using Gradient Gap theory

Establishing step size thresholds to prevent training collapse

Explaining performance limitations with fixed learning rates

Innovation

Methods, ideas, or system contributions that make the work stand out.

Defines Gradient Gap to formalize improvement direction

Derives step-size threshold based on Gradient Gap magnitude

Predicts critical step size scaling with response length

🔎 Similar Papers

No similar papers found.