Weight Updates as Activation Shifts: A Principled Framework for Steering

📅 2026-02-27

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Existing activation intervention methods rely heavily on empirical design and lack theoretical grounding, making it difficult to systematically achieve efficient model adaptation. This work establishes, for the first time, a first-order equivalence between activation-space interventions and weight-space fine-tuning, revealing that the outputs of later transformer blocks serve as highly expressive intervention locations. Building on this insight, we propose a theoretically grounded strategy for selecting optimal intervention points. Furthermore, we introduce a novel weight-activation joint adaptation paradigm that simultaneously optimizes in both spaces. By training only 0.04% of the model parameters, our method achieves 99.1%–99.8% of the performance of full fine-tuning across multiple tasks, significantly outperforming mainstream parameter-efficient approaches such as ReFT and LoRA.

Technology Category

Application Category

📝 Abstract

Activation steering promises to be an extremely parameter-efficient form of adaptation, but its effectiveness depends on critical design choices -- such as intervention location and parameterization -- that currently rely on empirical heuristics rather than a principled foundation. We establish a first-order equivalence between activation-space interventions and weight-space updates, deriving the conditions under which activation steering can replicate fine-tuning behavior. This equivalence yields a principled framework for steering design and identifies the post-block output as a theoretically-backed and highly expressive intervention site. We further explain why certain intervention locations outperform others and show that weight updates and activation updates play distinct, complementary functional roles. This analysis motivates a new approach -- joint adaptation -- that trains in both spaces simultaneously. Our post-block steering achieves accuracy within 0.2%-0.9%$ of full-parameter tuning, on average across tasks and models, while training only 0.04% of model parameters. It consistently outperforms prior activation steering methods such as ReFT and PEFT approaches including LoRA, while using significantly fewer parameters. Finally, we show that joint adaptation often surpasses the performance ceilings of weight and activation updates in isolation, introducing a new paradigm for efficient model adaptation.

Problem

Research questions and friction points this paper is trying to address.

activation steering

model adaptation

parameter efficiency

intervention location

weight updates

Innovation

Methods, ideas, or system contributions that make the work stand out.

activation steering

weight updates

joint adaptation