🤖 AI Summary
This work investigates why RLVR (Reinforcement Learning with Verifiable Rewards) achieves significant improvements in large language model reasoning performance using only sparse parameter updates. Method: We identify this sparsity as a conditional optimization bias—parameter updates concentrate heavily within an intrinsic preference subspace of the pretrained model, exhibiting strong consistency across tasks, datasets, and algorithms. To formalize this, we propose the “Three-Gating Theory,” enabling the first white-box, parameter-level analysis of RLVR learning dynamics. Leveraging KL-constrained optimization, spectrum-preserving subspace modeling, and geometric analysis, we characterize essential properties: update localization, non-principal-direction alignment, low curvature, and minimal spectral drift. Contribution/Results: We demonstrate that RLVR follows a fundamentally distinct optimization paradigm from supervised fine-tuning (SFT), rendering conventional PEFT methods ill-suited for RL settings. Our theoretical and geometric insights provide both foundational principles and practical guidance for designing robust, efficient RL-based fine-tuning algorithms.
📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) reliably improves the reasoning performance of large language models, yet it appears to modify only a small fraction of parameters. We revisit this paradox and show that sparsity is a surface artifact of a model-conditioned optimization bias: for a fixed pretrained model, updates consistently localize to preferred parameter regions, highly consistent across runs and largely invariant to datasets and RL recipes. We mechanistically explain these dynamics with a Three-Gate Theory: Gate I (KL Anchor) imposes a KL-constrained update; Gate II (Model Geometry) steers the step off principal directions into low-curvature, spectrum-preserving subspaces; and Gate III (Precision) hides micro-updates in non-preferred regions, making the off-principal bias appear as sparsity. We then validate this theory and, for the first time, provide a parameter-level characterization of RLVR's learning dynamics: RLVR learns off principal directions in weight space, achieving gains via minimal spectral drift, reduced principal-subspace rotation, and off-principal update alignment. In contrast, SFT targets principal weights, distorts the spectrum, and even lags RLVR. Together, these results provide the first parameter-space account of RLVR's training dynamics, revealing clear regularities in how parameters evolve. Crucially, we show that RL operates in a distinct optimization regime from SFT, so directly adapting SFT-era parameter-efficient fine-tuning (PEFT) methods can be flawed, as evidenced by our case studies on advanced sparse fine-tuning and LoRA variants. We hope this work charts a path toward a white-box understanding of RLVR and the design of geometry-aware, RLVR-native learning algorithms, rather than repurposed SFT-era heuristics.