Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the limitations of conventional sparse autoencoders, which model latent features as one-dimensional vectors and thereby fail to capture the intrinsic multidimensional structure of language model representations, often resulting in feature splitting and loss of geometric information. To overcome this, the authors propose the Subspace-Aware Sparse Autoencoder (SASA), which generalizes the decoder from a single vector to a learnable low-dimensional subspace. SASA enforces block sparsity via Top-s group gating and employs nuclear norm regularization to adaptively control the effective rank of each group. Theoretically, when the block size is at least the intrinsic feature dimension, a single group can globally reconstruct an entire feature slice with optimal fidelity, substantially reducing sample complexity. Experiments on GPT-2 and Mistral-7B demonstrate that SASA effectively mitigates feature splitting and absorption, enhances representational monosemy and interpretability, and matches or exceeds the performance of standard methods using roughly half the training tokens.

📝 Abstract

Sparse Autoencoders (SAEs) are widely used for mechanistic interpretability in large language models, yet their formulation assigns each latent feature a single decoder direction, implicitly assuming features to be one-dimensional. We show that this assumption mismatches with the multi-dimensional structure of model features, provably inducing feature splitting through two distinct mechanisms. Geometrically, reconstructing a feature of intrinsic dimension $d_i \ge 2$ to error $\varepsilon$ with single-direction decoders forces a number of atoms that is exponential in $d_i$. From an end-to-end optimization perspective, this splitting is not merely possible but actively preferred. We prove that there exists a continuous path from the true $d_i$-dimensional basis to a strictly lower risk of the $\ell_1$-regularized SAE objective, whose descent directions drive any trained dictionary into that exponential regime. A single coherent feature is therefore fragmented across many near-collinear latents, producing spurious multiplicity and obscuring the intrinsic geometry. Motivated by this, we introduce Subspace-Aware Sparse Autoencoders (SASA), which replace single-vector decoders with learned decoder subspaces, enforce block sparsity via Top-$s$ group gating, and adapt each group's effective rank with a nuclear-norm regularizer. We then show that once the block size satisfies $r \ge d_i$, a single group not only can represent the entire feature slice but is the global minimizer of the SASA objective. This consolidation yields a sample complexity polynomial in $d_i$ rather than exponential -- a decisive advantage given that every training activation costs an LLM forward pass. Empirically, on GPT-2 and Mistral-7B, SASA reduces feature splitting and absorption, improves monosemanticity and interpretability, and matches or exceeds standard SAEs while training on roughly half the token budget.

Problem

Research questions and friction points this paper is trying to address.

Sparse Autoencoders

feature splitting

mechanistic interpretability

subspace representation

multi-dimensional features

Innovation

Methods, ideas, or system contributions that make the work stand out.

Subspace-Aware Sparse Autoencoders

feature splitting

block sparsity