Disentangling MLP Neuron Weights in Vocabulary Space

📅 2026-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the interpretability challenge of MLP neurons in large language models, whose semantics remain opaque due to the absence of clear lexical correspondences in their weights. The authors propose ROTATE, a novel method that, for the first time, enables purely weight-space neuron disentanglement without requiring data or forward passes. ROTATE optimizes a rotation of the weight matrix in lexical space to maximize kurtosis, revealing a strong association between high lexical kurtosis and singular semantic concepts. This yields composable, interpretable neuron units. Experiments on Llama-3.1-8B-Instruct and Gemma-2-2B-it demonstrate that the recovered lexical channels faithfully reflect actual neuron behavior, with human evaluations showing that channel-level descriptions outperform activation-based baselines by a factor of 2–3.
📝 Abstract
Interpreting the information encoded in model weights remains a fundamental challenge in mechanistic interpretability. In this work, we introduce ROTATE (Rotation-Optimized Token Alignment in weighT spacE), a data-free method requiring no forward passes that disentangles MLP neurons directly in weight space. Our approach relies on a key statistical observation: neurons that encode coherent, monosemantic concepts exhibit high kurtosis when projected onto the model's vocabulary. By optimizing rotations of neuron weights to maximize their vocabulary-space kurtosis, our method recovers sparse, interpretable directions which we name vocabulary channels. Experiments on Llama-3.1-8B-Instruct and Gemma-2-2B-it demonstrate that ROTATE consistently recovers vocabulary channels that are faithful to the neuron's behavior. ablating individual channels selectively disables corresponding input activations or the promotion of specific concepts. Moreover, aggregating channel-level descriptions yields comprehensive neuron descriptions that outperform optimized activation-based baselines by 2-3x in head-to-head comparisons. By providing a data-free decomposition of neuron weights, ROTATE offers a scalable, fine-grained building block for interpreting LMs.
Problem

Research questions and friction points this paper is trying to address.

mechanistic interpretability
MLP neuron weights
vocabulary space
disentanglement
weight space
Innovation

Methods, ideas, or system contributions that make the work stand out.

ROTATE
weight-space disentanglement
vocabulary channels
kurtosis optimization
data-free interpretability
🔎 Similar Papers
No similar papers found.
A
Asaf Avrahamy
Blavatnik School of Computer Science and AI, Tel Aviv University
Y
Yoav Gur-Arieh
Blavatnik School of Computer Science and AI, Tel Aviv University
Mor Geva
Mor Geva
Tel Aviv University, Google Research
Natural Language Processing