Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment

📅 2025-02-06

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the discovery and semantic alignment of interpretable concepts across multiple pre-trained deep neural networks. To this end, we propose the Universal Sparse Autoencoder (USAE), which jointly learns a unified, sparse concept space across models, architectures, tasks, and datasets. Methodologically, USAE integrates joint activation reconstruction with overcomplete dictionary learning to achieve collaborative encoding and semantic alignment of hidden-layer features from diverse models. Our approach establishes the first cross-model unified concept space—overcoming the limitations of single-model interpretability methods. Experiments demonstrate that the learned concepts span hierarchical semantics—from low-level attributes (e.g., color, texture) to high-level object parts—and exhibit strong human interpretability. Moreover, USAE achieves high-fidelity activation reconstruction across multiple vision models and maximizes cross-model co-activation, confirming effective semantic alignment.

Technology Category

Application Category

📝 Abstract

We present Universal Sparse Autoencoders (USAEs), a framework for uncovering and aligning interpretable concepts spanning multiple pretrained deep neural networks. Unlike existing concept-based interpretability methods, which focus on a single model, USAEs jointly learn a universal concept space that can reconstruct and interpret the internal activations of multiple models at once. Our core insight is to train a single, overcomplete sparse autoencoder (SAE) that ingests activations from any model and decodes them to approximate the activations of any other model under consideration. By optimizing a shared objective, the learned dictionary captures common factors of variation-concepts-across different tasks, architectures, and datasets. We show that USAEs discover semantically coherent and important universal concepts across vision models; ranging from low-level features (e.g., colors and textures) to higher-level structures (e.g., parts and objects). Overall, USAEs provide a powerful new method for interpretable cross-model analysis and offers novel applications, such as coordinated activation maximization, that open avenues for deeper insights in multi-model AI systems

Problem

Research questions and friction points this paper is trying to address.

Interpretable concept alignment across models

Universal concept space for multiple networks

Cross-model analysis using sparse autoencoders

Innovation

Methods, ideas, or system contributions that make the work stand out.

Universal Sparse Autoencoders for cross-model alignment

Shared objective optimizes universal concept space

Interprets activations across diverse models and datasets

🔎 Similar Papers

Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models