🤖 AI Summary
Understanding how visual, textual, and multimodal encoders share conceptual representations remains challenging due to modality-specific architectures and lack of concept-level interpretability.
Method: We propose a cross-modal sparse autoencoder (SAE) framework for feature extraction and alignment, introducing “Comparative Sharedness”—a novel metric quantifying shared conceptual representations across encoders at the concept granularity. Our approach integrates SAE-based interpretable feature activation modeling with cross-model similarity measurement, avoiding strong assumptions of direct modality alignment.
Contribution/Results: Evaluated on 21 state-of-the-art encoders—including CLIP, Flamingo, and Qwen-VL—our analysis reveals that multimodal models’ visual representations significantly inherit conceptual structures from pretrained text encoders. This provides the first empirical evidence of deep semantic transfer from text pretraining to visual representation learning. The work establishes a new paradigm for multimodal representation alignment and interpretable AI, grounded in rigorous, concept-level comparative analysis.
📝 Abstract
Sparse autoencoders (SAEs) have emerged as a powerful technique for extracting human-interpretable features from neural networks activations. Previous works compared different models based on SAE-derived features but those comparisons have been restricted to models within the same modality. We propose a novel indicator allowing quantitative comparison of models across SAE features, and use it to conduct a comparative study of visual, textual and multimodal encoders. We also propose to quantify the Comparative Sharedness of individual features between different classes of models. With these two new tools, we conduct several studies on 21 encoders of the three types, with two significantly different sizes, and considering generalist and domain specific datasets. The results allow to revisit previous studies at the light of encoders trained in a multimodal context and to quantify to which extent all these models share some representations or features. They also suggest that visual features that are specific to VLMs among vision encoders are shared with text encoders, highlighting the impact of text pretraining. The code is available at https://github.com/CEA-LIST/SAEshareConcepts