When Attribution Patching Lies: Diagnosis and a Second-Order Correction

📅 2026-06-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Attribution Patching, an efficient approximation to activation patching in large language models, suffers from biased importance estimates due to downstream nonlinearities, hindering reliable identification of causal mechanisms. This work demonstrates for the first time that this error primarily stems from downstream network nonlinearities rather than local curvature and introduces a second-order correction based on Hessian-vector products (HVPs), which requires only one additional backward pass yet substantially improves fidelity. The authors propose a reliability score, error bound estimation, and a Screen-Flag-Fix workflow, validating their approach across five model families ranging from 124M to 9B parameters. Compared to Integrated Gradients, the HVP-corrected method achieves comparable or superior circuit recovery accuracy at lower computational cost.

📝 Abstract

A central goal of mechanistic interpretability is to identify which internal components causally drive a language model's behavior. Because these importance estimates serve as the evidence for identifying circuits, systematic errors can lead to the misidentification of the underlying mechanisms. While activation patching provides a gold-standard causal metric, its computational cost is prohibitive at scale. Practitioners instead rely on attribution patching, a gradient-based, first-order approximation whose reliability remains poorly understood. In this work, we characterize the source of this unreliability, demonstrating that the dominant error stems from the non-linearities in the downstream network rather than local curvature at the patched component. This insight yields three practical tools: (i) a reliability score to detect untrustworthy estimates, (ii) error bounds quantifying potential attribution mis-specifications, and (iii) a Hessian-vector-product (HVP) correction that eliminates the leading-order error with only one additional backward pass. In evaluations across five model families (124M-9B parameters) and both random-token and naturalistic (name-swap) perturbations, HVP is the only second-order correction feasible at larger scale, where standard baselines like Integrated Gradients become computationally prohibitive. In comparative experiments, a multi-step HVP variant matches or exceeds the accuracy of Integrated Gradients at significantly lower compute, outperforming prior second-order baselines. These improvements lead to higher-fidelity circuit recovery on standard benchmarks and support a Screen-Flag-Fix workflow that targets computational effort only toward the components flagged as unreliable.

Problem

Research questions and friction points this paper is trying to address.

mechanistic interpretability

attribution patching

causal attribution

systematic error

language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

attribution patching

Hessian-vector product

mechanistic interpretability