🤖 AI Summary
To address the prohibitively high computational cost of layer-wise sparse autoencoder (SAE) training in large language models (LLMs), this paper proposes a novel “layer-group joint training” paradigm. Instead of training SAEs independently per layer, our method groups consecutive transformer layers based on inter-layer activation similarity and jointly optimizes their SAEs under shared weight constraints. This enables cooperative learning while preserving reconstruction fidelity, feature interpretability, and downstream task performance—e.g., maintaining identical accuracy on Pythia-160M. Empirically, the approach achieves up to 6× training speedup without compromising quality. Our key contribution is the first systematic shift in SAE training granularity from individual layers to layer groups, thereby harmonizing efficiency and interpretability. This establishes a scalable training framework for interpretable representation learning in large-scale LLMs.
📝 Abstract
Sparse AutoEnocders (SAEs) have recently been employed as an unsupervised approach for understanding the inner workings of Large Language Models (LLMs). They reconstruct the model's activations with a sparse linear combination of interpretable features. However, training SAEs is computationally intensive, especially as models grow in size and complexity. To address this challenge, we propose a novel training strategy that reduces the number of trained SAEs from one per layer to one for a given group of contiguous layers. Our experimental results on Pythia 160M highlight a speedup of up to 6x without compromising the reconstruction quality and performance on downstream tasks. Therefore, layer clustering presents an efficient approach to train SAEs in modern LLMs.