Efficient Training of Sparse Autoencoders for Large Language Models via Layer Groups

📅 2024-10-28
🏛️ arXiv.org
📈 Citations: 6
Influential: 0
📄 PDF

career value

200K/year
🤖 AI Summary
To address the prohibitively high computational cost of layer-wise sparse autoencoder (SAE) training in large language models (LLMs), this paper proposes a novel “layer-group joint training” paradigm. Instead of training SAEs independently per layer, our method groups consecutive transformer layers based on inter-layer activation similarity and jointly optimizes their SAEs under shared weight constraints. This enables cooperative learning while preserving reconstruction fidelity, feature interpretability, and downstream task performance—e.g., maintaining identical accuracy on Pythia-160M. Empirically, the approach achieves up to 6× training speedup without compromising quality. Our key contribution is the first systematic shift in SAE training granularity from individual layers to layer groups, thereby harmonizing efficiency and interpretability. This establishes a scalable training framework for interpretable representation learning in large-scale LLMs.

Technology Category

Application Category

📝 Abstract
Sparse AutoEnocders (SAEs) have recently been employed as an unsupervised approach for understanding the inner workings of Large Language Models (LLMs). They reconstruct the model's activations with a sparse linear combination of interpretable features. However, training SAEs is computationally intensive, especially as models grow in size and complexity. To address this challenge, we propose a novel training strategy that reduces the number of trained SAEs from one per layer to one for a given group of contiguous layers. Our experimental results on Pythia 160M highlight a speedup of up to 6x without compromising the reconstruction quality and performance on downstream tasks. Therefore, layer clustering presents an efficient approach to train SAEs in modern LLMs.
Problem

Research questions and friction points this paper is trying to address.

Training sparse autoencoders for large language models is computationally intensive
Current approaches require separate SAE training for each model layer
Need efficient SAE training method that maintains performance and interpretability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Groups similar layers to share SAEs
Introduces AMAD metric for optimal grouping
Accelerates training with minimal performance impact