Partially Rewriting a Transformer in Natural Language

📅 2025-01-31

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work investigates natural-language-level interpretability-driven rewriting of critical modules—feed-forward network (FFN) layers and residual streams—in large language models (LLMs), without degrading functional performance. Method: It first employs sparse autoencoders (Transcoders) to identify highly activated, semantically coherent neurons; then constructs the first LLM-based natural-language-driven neuron simulator that predicts neuron activation from explanatory text and context; finally introduces a loss-perturbation-based behavioral fidelity metric to quantitatively assess rewriting distortion. Results: Under current natural-language explanation granularity, rewritten modules exhibit no statistically significant performance difference from zeroing-out—revealing that state-of-the-art interpretability explanations do not surpass the zero-vector baseline. This work establishes the first module-level natural-language rewriting framework with empirically grounded, quantitative distortion evaluation, setting a new empirical benchmark for interpretable AI.

Technology Category

Application Category

📝 Abstract

The greatest ambition of mechanistic interpretability is to completely rewrite deep neural networks in a format that is more amenable to human understanding, while preserving their behavior and performance. In this paper, we attempt to partially rewrite a large language model using simple natural language explanations. We first approximate one of the feedforward networks in the LLM with a wider MLP with sparsely activating neurons - a transcoder - and use an automated interpretability pipeline to generate explanations for these neurons. We then replace the first layer of this sparse MLP with an LLM-based simulator, which predicts the activation of each neuron given its explanation and the surrounding context. Finally, we measure the degree to which these modifications distort the model's final output. With our pipeline, the model's increase in loss is statistically similar to entirely replacing the sparse MLP output with the zero vector. We employ the same protocol, this time using a sparse autoencoder, on the residual stream of the same layer and obtain similar results. These results suggest that more detailed explanations are needed to improve performance substantially above the zero ablation baseline.

Problem

Research questions and friction points this paper is trying to address.

Interpretability

Deep Neural Networks

Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Interpretable AI

Sparse Autoencoders

Transformer Models

🔎 Similar Papers

No similar papers found.