Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment

📅 2025-02-06
📈 Citations: 0
Influential: 0
📄 PDF

career value

189K/year
🤖 AI Summary
This work addresses the discovery and semantic alignment of interpretable concepts across multiple pre-trained deep neural networks. To this end, we propose the Universal Sparse Autoencoder (USAE), which jointly learns a unified, sparse concept space across models, architectures, tasks, and datasets. Methodologically, USAE integrates joint activation reconstruction with overcomplete dictionary learning to achieve collaborative encoding and semantic alignment of hidden-layer features from diverse models. Our approach establishes the first cross-model unified concept space—overcoming the limitations of single-model interpretability methods. Experiments demonstrate that the learned concepts span hierarchical semantics—from low-level attributes (e.g., color, texture) to high-level object parts—and exhibit strong human interpretability. Moreover, USAE achieves high-fidelity activation reconstruction across multiple vision models and maximizes cross-model co-activation, confirming effective semantic alignment.

Technology Category

Application Category

📝 Abstract
We present Universal Sparse Autoencoders (USAEs), a framework for uncovering and aligning interpretable concepts spanning multiple pretrained deep neural networks. Unlike existing concept-based interpretability methods, which focus on a single model, USAEs jointly learn a universal concept space that can reconstruct and interpret the internal activations of multiple models at once. Our core insight is to train a single, overcomplete sparse autoencoder (SAE) that ingests activations from any model and decodes them to approximate the activations of any other model under consideration. By optimizing a shared objective, the learned dictionary captures common factors of variation-concepts-across different tasks, architectures, and datasets. We show that USAEs discover semantically coherent and important universal concepts across vision models; ranging from low-level features (e.g., colors and textures) to higher-level structures (e.g., parts and objects). Overall, USAEs provide a powerful new method for interpretable cross-model analysis and offers novel applications, such as coordinated activation maximization, that open avenues for deeper insights in multi-model AI systems
Problem

Research questions and friction points this paper is trying to address.

Interpretable concept alignment across models
Universal concept space for multiple networks
Cross-model analysis using sparse autoencoders
Innovation

Methods, ideas, or system contributions that make the work stand out.

Universal Sparse Autoencoders for cross-model alignment
Shared objective optimizes universal concept space
Interprets activations across diverse models and datasets