🤖 AI Summary
This paper addresses the lack of theoretical foundations for Verifiable Reward Reinforcement Learning (RLVR) in post-training large language models. We systematically analyze its trajectory-level and token-level optimization dynamics. First, we introduce the novel concept of “gradient gap” to quantitatively characterize the directional bias in policy updates under binary feedback, enabling a convergence theory that reveals a sharp phase transition—between convergence and collapse—governed by a critical step-size threshold. Our theoretical analysis formally justifies empirical practices such as length normalization and predicts how response length and success rate influence the critical step size. Through bandit simulations and GRPO experiments on Qwen2.5-7B, we empirically validate the predicted stagnation behavior and convergence boundaries. This work establishes the first interpretable and verifiable theoretical framework for RLVR, bridging theory and practice in reward-guided LLM alignment.
📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR), which uses simple binary feedback to post-train large language models, has shown significant empirical success. However, a principled understanding of why it works has been lacking. This paper builds a theoretical foundation for RLVR by analyzing its training process at both the full-response (trajectory) and token levels. Central to our analysis is a quantity called the Gradient Gap, which formalizes the direction of improvement from low-reward to high-reward regions of the response space. We prove that convergence critically depends on aligning the update direction with this Gradient Gap. Moreover, we derive a sharp step-size threshold based on the magnitude of the Gradient Gap: below it, learning converges, whereas above it, performance collapses. Our theory further predicts how the critical step size must scale with response length and the success rate, thereby explaining why practical heuristics such as length normalization improve stability and showing that, with a fixed learning rate, the success rate can stagnate strictly below $100%$. We validate these predictions through controlled bandit simulations and LLM experiments, including training Qwen2.5-7B with GRPO.