🤖 AI Summary
Sparse autoencoders (SAEs) exhibit pathological “latent redundancy”—where the number of effectively activated features is significantly lower than the number of latent variables—particularly when neural activations lie on high-dimensional feature manifolds.
Method: We introduce a capacity allocation modeling framework that formalizes SAEs’ linear decomposition and sparse representation as a dynamic allocation of representational capacity over the underlying manifold geometry.
Contribution/Results: Our model successfully reproduces multi-stage scaling regimes and, for the first time, systematically reveals how manifold geometric properties—such as curvature and dimensional coupling—suppress feature activation density, thereby inducing latent redundancy. Empirical validation on activation data from large language models confirms the existence of this pathological regime. The framework provides both theoretical grounding and diagnostic tools for interpretable SAE modeling and architecture design, bridging manifold geometry with sparse coding behavior in neural representations.
📝 Abstract
Sparse autoencoders (SAEs) model the activations of a neural network as linear combinations of sparsely occurring directions of variation (latents). The ability of SAEs to reconstruct activations follows scaling laws w.r.t. the number of latents. In this work, we adapt a capacity-allocation model from the neural scaling literature (Brill, 2024) to understand SAE scaling, and in particular, to understand how "feature manifolds" (multi-dimensional features) influence scaling behavior. Consistent with prior work, the model recovers distinct scaling regimes. Notably, in one regime, feature manifolds have the pathological effect of causing SAEs to learn far fewer features in data than there are latents in the SAE. We provide some preliminary discussion on whether or not SAEs are in this pathological regime in the wild.