RelP: Faithful and Efficient Circuit Discovery via Relevance Patching

📅 2025-08-28

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Activation patching—a standard technique in mechanistic interpretability for localizing behaviorally relevant components—is computationally expensive. Attribution patching offers efficiency but suffers from noise sensitivity and reduced reliability in deep nonlinear models. To address this trade-off, we propose Relevance Patching (RelP), the first method to integrate Layer-wise Relevance Propagation (LRP) propagation coefficients into the patching framework. RelP performs relevance redistribution via both forward and backward passes, requiring only two forward and one backward pass—achieving high computational efficiency while preserving fidelity. In analyzing MLP outputs of GPT-2 Large, RelP achieves a correlation of 0.956 with activation patching (up from 0.006), matching the performance of integrated gradients while incurring significantly lower computational cost.

Technology Category

Application Category

📝 Abstract

Activation patching is a standard method in mechanistic interpretability for localizing the components of a model responsible for specific behaviors, but it is computationally expensive to apply at scale. Attribution patching offers a faster, gradient-based approximation, yet suffers from noise and reduced reliability in deep, highly non-linear networks. In this work, we introduce Relevance Patching (RelP), which replaces the local gradients in attribution patching with propagation coefficients derived from Layer-wise Relevance Propagation (LRP). LRP propagates the network's output backward through the layers, redistributing relevance to lower-level components according to local propagation rules that ensure properties such as relevance conservation or improved signal-to-noise ratio. Like attribution patching, RelP requires only two forward passes and one backward pass, maintaining computational efficiency while improving faithfulness. We validate RelP across a range of models and tasks, showing that it more accurately approximates activation patching than standard attribution patching, particularly when analyzing residual stream and MLP outputs in the Indirect Object Identification (IOI) task. For instance, for MLP outputs in GPT-2 Large, attribution patching achieves a Pearson correlation of 0.006, whereas RelP reaches 0.956, highlighting the improvement offered by RelP. Additionally, we compare the faithfulness of sparse feature circuits identified by RelP and Integrated Gradients (IG), showing that RelP achieves comparable faithfulness without the extra computational cost associated with IG.

Problem

Research questions and friction points this paper is trying to address.

Improving faithfulness of gradient-based attribution patching in deep networks

Reducing computational cost of activation patching for circuit discovery

Enhancing reliability of mechanistic interpretability methods in nonlinear models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Replaces gradients with LRP propagation coefficients

Maintains computational efficiency with two forward passes

Improves faithfulness in deep non-linear networks

🔎 Similar Papers

What is the Relationship between Tensor Factorizations and Circuits (and How Can We Exploit it)?