Ensembling Sparse Autoencoders

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Sparse autoencoders (SAEs) can extract human-interpretable neural features, but single-model SAEs suffer from initialization sensitivity and limited coverage of the activation space, resulting in low feature coverage and poor robustness. This work introduces ensemble learning to SAE training for the first time, proposing a multi-SAE collaborative framework combining bagging and boosting: bagging enhances feature diversity and stability, while boosting employs residual-driven sequential training to improve reconstruction accuracy. Evaluated across three language model families and multiple SAE architectures, our method reduces activation reconstruction error by 18.7% on average, increases feature coverage by 32%, and decreases training variance by 41%. Downstream performance improves by 5.2% in concept detection and 7.9% in pseudo-relevance mitigation. The ensemble consistently outperforms single SAE baselines in generalization and robustness.

Technology Category

Application Category

📝 Abstract

Sparse autoencoders (SAEs) are used to decompose neural network activations into human-interpretable features. Typically, features learned by a single SAE are used for downstream applications. However, it has recently been shown that SAEs trained with different initial weights can learn different features, demonstrating that a single SAE captures only a limited subset of features that can be extracted from the activation space. Motivated by this limitation, we propose to ensemble multiple SAEs through naive bagging and boosting. Specifically, SAEs trained with different weight initializations are ensembled in naive bagging, whereas SAEs sequentially trained to minimize the residual error are ensembled in boosting. We evaluate our ensemble approaches with three settings of language models and SAE architectures. Our empirical results demonstrate that ensembling SAEs can improve the reconstruction of language model activations, diversity of features, and SAE stability. Furthermore, ensembling SAEs performs better than applying a single SAE on downstream tasks such as concept detection and spurious correlation removal, showing improved practical utility.

Problem

Research questions and friction points this paper is trying to address.

Improving feature diversity in sparse autoencoders

Enhancing reconstruction of language model activations

Boosting performance in downstream practical tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Ensemble multiple sparse autoencoders for diversity

Use naive bagging with different weight initializations

Apply boosting to minimize residual error sequentially

🔎 Similar Papers

Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models