The Path Not Taken: RLVR Provably Learns Off the Principals

📅 2025-11-11

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work investigates why RLVR (Reinforcement Learning with Verifiable Rewards) achieves significant improvements in large language model reasoning performance using only sparse parameter updates. Method: We identify this sparsity as a conditional optimization bias—parameter updates concentrate heavily within an intrinsic preference subspace of the pretrained model, exhibiting strong consistency across tasks, datasets, and algorithms. To formalize this, we propose the “Three-Gating Theory,” enabling the first white-box, parameter-level analysis of RLVR learning dynamics. Leveraging KL-constrained optimization, spectrum-preserving subspace modeling, and geometric analysis, we characterize essential properties: update localization, non-principal-direction alignment, low curvature, and minimal spectral drift. Contribution/Results: We demonstrate that RLVR follows a fundamentally distinct optimization paradigm from supervised fine-tuning (SFT), rendering conventional PEFT methods ill-suited for RL settings. Our theoretical and geometric insights provide both foundational principles and practical guidance for designing robust, efficient RL-based fine-tuning algorithms.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) reliably improves the reasoning performance of large language models, yet it appears to modify only a small fraction of parameters. We revisit this paradox and show that sparsity is a surface artifact of a model-conditioned optimization bias: for a fixed pretrained model, updates consistently localize to preferred parameter regions, highly consistent across runs and largely invariant to datasets and RL recipes. We mechanistically explain these dynamics with a Three-Gate Theory: Gate I (KL Anchor) imposes a KL-constrained update; Gate II (Model Geometry) steers the step off principal directions into low-curvature, spectrum-preserving subspaces; and Gate III (Precision) hides micro-updates in non-preferred regions, making the off-principal bias appear as sparsity. We then validate this theory and, for the first time, provide a parameter-level characterization of RLVR's learning dynamics: RLVR learns off principal directions in weight space, achieving gains via minimal spectral drift, reduced principal-subspace rotation, and off-principal update alignment. In contrast, SFT targets principal weights, distorts the spectrum, and even lags RLVR. Together, these results provide the first parameter-space account of RLVR's training dynamics, revealing clear regularities in how parameters evolve. Crucially, we show that RL operates in a distinct optimization regime from SFT, so directly adapting SFT-era parameter-efficient fine-tuning (PEFT) methods can be flawed, as evidenced by our case studies on advanced sparse fine-tuning and LoRA variants. We hope this work charts a path toward a white-box understanding of RLVR and the design of geometry-aware, RLVR-native learning algorithms, rather than repurposed SFT-era heuristics.

Problem

Research questions and friction points this paper is trying to address.

Explaining why RLVR appears sparse while improving reasoning performance

Characterizing RLVR's learning dynamics in parameter space versus SFT

Demonstrating RL operates in distinct optimization regime from SFT

Innovation

Methods, ideas, or system contributions that make the work stand out.

RLVR learns off principal weight directions

Uses three-gate theory for update localization

Achieves gains via minimal spectral drift

🔎 Similar Papers

BOWL: A Deceptively Simple Open World Learner