🤖 AI Summary
This work investigates the universality and transferability of parameter update sparsity during reinforcement learning (RL) fine-tuning of large language models (LLMs). Across diverse RL algorithms—including PPO, DPO, SimPO, and PRIME—and multiple model families (OpenAI, Meta, and mainstream open-source LLMs), we systematically observe that RL fine-tuning consistently concentrates updates within a sparse subnetwork comprising only 5%–30% of parameters, irrespective of algorithm, architecture, or downstream task; moreover, this subnetwork exhibits high consistency across tasks and random seeds. We attribute this phenomenon to the interplay between gradient magnitude distributions and KL-divergence–constrained optimization trajectories. Crucially, we demonstrate that freezing all but this identified subnetwork—i.e., performing sparse fine-tuning—fully recovers the performance of full-parameter fine-tuning, with near-identical parameter update trajectories. This challenges the prevailing assumption that full-parameter updates are necessary for effective alignment, and establishes a transferable, computationally efficient sparse learning paradigm for LLM alignment.
📝 Abstract
Reinforcement learning (RL) is a key post-pretraining step for aligning large language models (LLMs) with complex tasks and human preferences. While it is often assumed that RL fine-tuning requires updating most of a model's parameters, we challenge this assumption with a surprising finding: RL fine-tuning consistently modifies only a small subnetwork (typically 5-30% of weights), leaving most parameters unchanged. We call this phenomenon RL-induced parameter update sparsity. It arises naturally, without any sparsity constraints or parameter-efficient tuning, and appears across multiple RL algorithms (e.g., PPO, DPO, SimPO, PRIME) and model families (e.g., OpenAI, Meta, and open-source LLMs). Moreover, the subnetworks updated by RL show substantial overlap across different seeds, datasets, and algorithms-far exceeding chance-suggesting a partially transferable structure in the pretrained model. We show that fine-tuning only this sparse subnetwork recovers full model performance and yields parameters nearly identical to the fully fine-tuned model. Our analysis suggests this sparsity emerges because RL operates near the model's original distribution, requiring only targeted changes. KL penalties, gradient clipping, and on-policy dynamics have limited effect on the sparsity pattern. These findings shed new light on how RL adapts models: not by shifting all weights, but by focusing training on a small, consistently updated subnetwork. This insight enables more efficient RL methods and reframes sparsity through the lens of the lottery ticket hypothesis.