Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition

📅 2025-01-24

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Neural network interpretability lacks formal, mechanism-grounded parameter decomposition that is both input-adaptive and faithful to underlying computational structures. Method: We propose Attribution-based Parameter Decomposition (APD), the first method to perform mechanism-driven parameter decomposition directly in parameter space. APD integrates attribution-guided differentiable modeling, mechanism sparsity regularization, and description-length minimization to decompose parameters into concise, input-adaptive, and faithful mechanistic components. Contribution/Results: APD provides a formal conceptual foundation for “features” in deep networks, enabling identification of cross-layer distributed representations and hyper-localized mechanisms. In controlled experiments, APD successfully recovers hyper-localized features, disentangles compressive computations, and localizes cross-layer distributed mechanisms. It establishes a novel paradigm for minimal-circuit discovery and architecture-agnostic parameter decomposition, advancing mechanistic interpretability beyond post-hoc attribution.

Technology Category

Application Category

📝 Abstract

Mechanistic interpretability aims to understand the internal mechanisms learned by neural networks. Despite recent progress toward this goal, it remains unclear how best to decompose neural network parameters into mechanistic components. We introduce Attribution-based Parameter Decomposition (APD), a method that directly decomposes a neural network's parameters into components that (i) are faithful to the parameters of the original network, (ii) require a minimal number of components to process any input, and (iii) are maximally simple. Our approach thus optimizes for a minimal length description of the network's mechanisms. We demonstrate APD's effectiveness by successfully identifying ground truth mechanisms in multiple toy experimental settings: Recovering features from superposition; separating compressed computations; and identifying cross-layer distributed representations. While challenges remain to scaling APD to non-toy models, our results suggest solutions to several open problems in mechanistic interpretability, including identifying minimal circuits in superposition, offering a conceptual foundation for 'features', and providing an architecture-agnostic framework for neural network decomposition.

Problem

Research questions and friction points this paper is trying to address.

Neural Network

Parameter Decomposition

Interpretability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Attribution-based Parameter Decomposition

Neural Network Interpretability

Efficient Information Processing

🔎 Similar Papers

No similar papers found.

Authors to Follow