Interpretability-Guided Layer Selection over Subspace Projection: SAEs as Stethoscopes, Not Scalpels, for Raw Task Vector Model Editing

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses the substantial degradation—approximately 97% loss of modification energy—observed in existing sparse autoencoder (SAE)-based model editing methods due to geometric misalignment when projecting task vectors into the SAE feature subspace, which severely limits their efficacy in enhancing mathematical reasoning. The authors propose repurposing the SAE as a “stethoscope” rather than a “scalpel,” leveraging SAE-derived layer-specific scores to identify critical intervention layers and directly injecting the original task vectors at these locations, thereby circumventing information bottlenecks. This approach overturns the conventional paradigm of feature-level SAE interventions, enabling interpretable, lossless, deterministic edits with zero additional inference overhead. On the Minerva Math benchmark, it improves Number Theory accuracy from 29.6% to 39.4% (z = +3.41, p = 0.0007), with five out of seven mathematical subjects showing statistically significant gains and none exhibiting significant declines.

📝 Abstract

LLMs increasingly require surgical model editing to enhance domain-specific capabilities without incurring the computational cost or catastrophic forgetting associated with full fine-tuning. Sparse Autoencoders (SAEs) have emerged as a promising tool in this setting, in principle allowing for feature-level identification of where to intervene. In this work, we rigorously evaluate an SAE-guided editing pipeline for mathematical reasoning on Gemma-3-4B-IT and uncover a fundamental failure mode: the intuitively appealing approach of projecting task vectors onto SAE feature subspaces acts as an information bottleneck that discards approximately 97% of the modification energy, yielding no statistically significant improvements across seven math subjects. We show that this failure stems from a geometric misalignment between activation-space SAE directions and weight-space task vectors. We then propose a shift in perspective: SAE as a Stethoscope, Not a Scalpel, where SAEs are used for layer-level diagnosis rather than intervention-level filtering. By injecting unfiltered raw task vectors only into layers identified by an SAE-derived specificity score, we improve Number Theory accuracy from 29.6% to 39.4% (z=+3.41, p=0.0007) on the Minerva Math benchmark; 5 of 7 math subjects significantly improved and none significantly degraded. Our method is fully deterministic, requires no additional inference cost, and provides a principled framework for interpretability-guided model editing.

Problem

Research questions and friction points this paper is trying to address.

model editing

Sparse Autoencoders

task vectors

interpretability

mathematical reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Autoencoders

Model Editing

Task Vectors