🤖 AI Summary
This work addresses the discovery and semantic alignment of interpretable concepts across multiple pre-trained deep neural networks. To this end, we propose the Universal Sparse Autoencoder (USAE), which jointly learns a unified, sparse concept space across models, architectures, tasks, and datasets. Methodologically, USAE integrates joint activation reconstruction with overcomplete dictionary learning to achieve collaborative encoding and semantic alignment of hidden-layer features from diverse models. Our approach establishes the first cross-model unified concept space—overcoming the limitations of single-model interpretability methods. Experiments demonstrate that the learned concepts span hierarchical semantics—from low-level attributes (e.g., color, texture) to high-level object parts—and exhibit strong human interpretability. Moreover, USAE achieves high-fidelity activation reconstruction across multiple vision models and maximizes cross-model co-activation, confirming effective semantic alignment.
📝 Abstract
We present Universal Sparse Autoencoders (USAEs), a framework for uncovering and aligning interpretable concepts spanning multiple pretrained deep neural networks. Unlike existing concept-based interpretability methods, which focus on a single model, USAEs jointly learn a universal concept space that can reconstruct and interpret the internal activations of multiple models at once. Our core insight is to train a single, overcomplete sparse autoencoder (SAE) that ingests activations from any model and decodes them to approximate the activations of any other model under consideration. By optimizing a shared objective, the learned dictionary captures common factors of variation-concepts-across different tasks, architectures, and datasets. We show that USAEs discover semantically coherent and important universal concepts across vision models; ranging from low-level features (e.g., colors and textures) to higher-level structures (e.g., parts and objects). Overall, USAEs provide a powerful new method for interpretable cross-model analysis and offers novel applications, such as coordinated activation maximization, that open avenues for deeper insights in multi-model AI systems