Weight Updates as Activation Shifts: A Principled Framework for Steering

📅 2026-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing activation intervention methods rely heavily on empirical design and lack theoretical grounding, making it difficult to systematically achieve efficient model adaptation. This work establishes, for the first time, a first-order equivalence between activation-space interventions and weight-space fine-tuning, revealing that the outputs of later transformer blocks serve as highly expressive intervention locations. Building on this insight, we propose a theoretically grounded strategy for selecting optimal intervention points. Furthermore, we introduce a novel weight-activation joint adaptation paradigm that simultaneously optimizes in both spaces. By training only 0.04% of the model parameters, our method achieves 99.1%–99.8% of the performance of full fine-tuning across multiple tasks, significantly outperforming mainstream parameter-efficient approaches such as ReFT and LoRA.

Technology Category

Application Category

📝 Abstract
Activation steering promises to be an extremely parameter-efficient form of adaptation, but its effectiveness depends on critical design choices -- such as intervention location and parameterization -- that currently rely on empirical heuristics rather than a principled foundation. We establish a first-order equivalence between activation-space interventions and weight-space updates, deriving the conditions under which activation steering can replicate fine-tuning behavior. This equivalence yields a principled framework for steering design and identifies the post-block output as a theoretically-backed and highly expressive intervention site. We further explain why certain intervention locations outperform others and show that weight updates and activation updates play distinct, complementary functional roles. This analysis motivates a new approach -- joint adaptation -- that trains in both spaces simultaneously. Our post-block steering achieves accuracy within 0.2%-0.9%$ of full-parameter tuning, on average across tasks and models, while training only 0.04% of model parameters. It consistently outperforms prior activation steering methods such as ReFT and PEFT approaches including LoRA, while using significantly fewer parameters. Finally, we show that joint adaptation often surpasses the performance ceilings of weight and activation updates in isolation, introducing a new paradigm for efficient model adaptation.
Problem

Research questions and friction points this paper is trying to address.

activation steering
model adaptation
parameter efficiency
intervention location
weight updates
Innovation

Methods, ideas, or system contributions that make the work stand out.

activation steering
weight updates
joint adaptation
parameter-efficient fine-tuning
post-block intervention
🔎 Similar Papers
No similar papers found.
Dyah Adila
Dyah Adila
University of Wisconsin-Madison
Machine Learning
J
John Cooper
Department of Computer Science, University of Wisconsin-Madison
A
Alexander Yun
Department of Computer Science, University of Wisconsin-Madison
A
Avi Trost
Department of Computer Science, University of Wisconsin-Madison
Frederic Sala
Frederic Sala
Assistant Professor, University of Wisconsin
Data-centric AIMachine learningInformation theory