Tracking Equivalent Mechanistic Interpretations Across Neural Networks

📅 2026-03-31

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current mechanistic interpretability lacks a precise definition of “valid explanation” and relies heavily on manual construction, hindering scalability. This work formalizes, for the first time, the problem of *explanatory equivalence*—determining whether distinct neural networks implement the same underlying algorithmic mechanism without requiring an explicit description of that mechanism. By establishing theoretical connections among explanations, circuits, and internal representations, the study proposes a method for assessing equivalence based on representational similarity analysis and validates its efficacy on Transformer models. Furthermore, it provides necessary and sufficient conditions for explanatory equivalence, thereby offering a rigorous evaluation framework and a pathway toward automated discovery in mechanistic interpretability.

📝 Abstract

Mechanistic interpretability (MI) is an emerging framework for interpreting neural networks. Given a task and model, MI aims to discover a succinct algorithmic process, an interpretation, that explains the model's decision process on that task. However, MI is difficult to scale and generalize. This stems in part from two key challenges: there is no precise notion of a valid interpretation; and, generating interpretations is often an ad hoc process. In this paper, we address these challenges by defining and studying the problem of interpretive equivalence: determining whether two different models share a common interpretation, without requiring an explicit description of what that interpretation is. At the core of our approach, we propose and formalize the principle that two interpretations of a model are equivalent if all of their possible implementations are also equivalent. We develop an algorithm to estimate interpretive equivalence and case study its use on Transformer-based models. To analyze our algorithm, we introduce necessary and sufficient conditions for interpretive equivalence based on models' representation similarity. We provide guarantees that simultaneously relate a model's algorithmic interpretations, circuits, and representations. Our framework lays a foundation for the development of more rigorous evaluation methods of MI and automated, generalizable interpretation discovery methods.

Problem

Research questions and friction points this paper is trying to address.

mechanistic interpretability

interpretive equivalence

neural networks

algorithmic interpretation

representation similarity

Innovation

Methods, ideas, or system contributions that make the work stand out.

interpretive equivalence

mechanistic interpretability

representation similarity

algorithmic interpretation

Transformer models

🔎 Similar Papers

No similar papers found.

Authors to Follow