Interactions Between Crosscoder Features: A Compact Proofs Perspective

📅 2026-06-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses a key limitation in conventional dictionary learning methods—such as cross-encoders—which typically assume feature independence and thereby neglect the impact of feature interactions on reconstruction error. The authors formalize feature interactions as a differentiable loss term for the first time, establishing a compact theoretical framework to quantify and optimize interaction structures. This formulation is integrated into loss regularization, feature clustering, and dormant neuron analysis. Remarkably, under an extremely sparse setting where only a single feature per data point and per neuron is retained, the proposed method preserves 60% of the original MLP performance—substantially outperforming standard approaches that achieve merely 10%. Furthermore, it uncovers semantically coherent feature clusters and reveals their strong association with dormant neurons, significantly enhancing model interpretability and efficiency.

📝 Abstract

Dictionary learning methods like Sparse Autoencoders (SAEs) and crosscoders attempt to explain a model by decomposing its activations into independent features. Interactions between features hence induce errors in the reconstruction. We formalize this intuition via compact proofs and make five contributions. First, we show how, \textit{in principle}, a compact proof of model performance can be constructed using a crosscoder. Second, we show that an error term arising in this proof can naturally be interpreted as a measure of interaction between crosscoder features and provide an explicit expression for the interaction term in the Multi-Layer Perceptron (MLP) layers. We then provide three applications of this new interaction measure. In our third contribution we show that the interaction term itself can be used as a differentiable loss penalty. Applying this penalty, we can achieve ``computationally sparse'' crosscoders that retain $60\%$ of MLP performance when only keeping a single feature at each datapoint and neuron, compared to $10\%$ in standard crosscoders. We then show that clustering according to our interaction measure provides semantically meaningful feature clusters, and finally that sleeper agents have significant interactions. Code is available at https://github.com/chainik1125/crosscoders-feature-interactions/tree/arxiv.

Problem

Research questions and friction points this paper is trying to address.

crosscoder

feature interaction

dictionary learning

model interpretability

reconstruction error

Innovation

Methods, ideas, or system contributions that make the work stand out.

feature interactions

crosscoder

compact proofs