🤖 AI Summary
This paper investigates the theoretical limits of causal abstraction in mechanistic interpretability—specifically, whether the framework retains explanatory power when the linearity constraint on alignment maps is relaxed. Method: Through rigorous theoretical analysis and empirical validation—measuring task intervention accuracy on randomly initialized language models—the authors demonstrate that arbitrary neural networks can “simulate” any algorithm via highly nonlinear alignment maps. Contribution/Results: They establish that such unconstrained causal abstraction becomes trivial: it lacks discriminative and informative capacity, leading to the “nonlinear representation dilemma.” Empirically, intervention accuracy reaches 100% even when the model has not learned the target task. The results show that causal abstraction without prior assumptions—e.g., structural inductive biases—is inherently vacuous; its explanatory utility critically depends on the form of constraints imposed on alignment maps. This work delineates a fundamental boundary for interpretability theory and underscores the necessity of explicit modeling assumptions.
📝 Abstract
The concept of causal abstraction got recently popularised to demystify the opaque decision-making processes of machine learning models; in short, a neural network can be abstracted as a higher-level algorithm if there exists a function which allows us to map between them. Notably, most interpretability papers implement these maps as linear functions, motivated by the linear representation hypothesis: the idea that features are encoded linearly in a model's representations. However, this linearity constraint is not required by the definition of causal abstraction. In this work, we critically examine the concept of causal abstraction by considering arbitrarily powerful alignment maps. In particular, we prove that under reasonable assumptions, any neural network can be mapped to any algorithm, rendering this unrestricted notion of causal abstraction trivial and uninformative. We complement these theoretical findings with empirical evidence, demonstrating that it is possible to perfectly map models to algorithms even when these models are incapable of solving the actual task; e.g., on an experiment using randomly initialised language models, our alignment maps reach 100% interchange-intervention accuracy on the indirect object identification task. This raises the non-linear representation dilemma: if we lift the linearity constraint imposed to alignment maps in causal abstraction analyses, we are left with no principled way to balance the inherent trade-off between these maps' complexity and accuracy. Together, these results suggest an answer to our title's question: causal abstraction is not enough for mechanistic interpretability, as it becomes vacuous without assumptions about how models encode information. Studying the connection between this information-encoding assumption and causal abstraction should lead to exciting future work.