Explaining How Visual, Textual and Multimodal Encoders Share Concepts

📅 2025-07-24

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Understanding how visual, textual, and multimodal encoders share conceptual representations remains challenging due to modality-specific architectures and lack of concept-level interpretability. Method: We propose a cross-modal sparse autoencoder (SAE) framework for feature extraction and alignment, introducing “Comparative Sharedness”—a novel metric quantifying shared conceptual representations across encoders at the concept granularity. Our approach integrates SAE-based interpretable feature activation modeling with cross-model similarity measurement, avoiding strong assumptions of direct modality alignment. Contribution/Results: Evaluated on 21 state-of-the-art encoders—including CLIP, Flamingo, and Qwen-VL—our analysis reveals that multimodal models’ visual representations significantly inherit conceptual structures from pretrained text encoders. This provides the first empirical evidence of deep semantic transfer from text pretraining to visual representation learning. The work establishes a new paradigm for multimodal representation alignment and interpretable AI, grounded in rigorous, concept-level comparative analysis.

Technology Category

Application Category

📝 Abstract

Sparse autoencoders (SAEs) have emerged as a powerful technique for extracting human-interpretable features from neural networks activations. Previous works compared different models based on SAE-derived features but those comparisons have been restricted to models within the same modality. We propose a novel indicator allowing quantitative comparison of models across SAE features, and use it to conduct a comparative study of visual, textual and multimodal encoders. We also propose to quantify the Comparative Sharedness of individual features between different classes of models. With these two new tools, we conduct several studies on 21 encoders of the three types, with two significantly different sizes, and considering generalist and domain specific datasets. The results allow to revisit previous studies at the light of encoders trained in a multimodal context and to quantify to which extent all these models share some representations or features. They also suggest that visual features that are specific to VLMs among vision encoders are shared with text encoders, highlighting the impact of text pretraining. The code is available at https://github.com/CEA-LIST/SAEshareConcepts

Problem

Research questions and friction points this paper is trying to address.

Quantify feature sharing across visual, textual, multimodal encoders

Compare models using SAE-derived cross-modal indicators

Analyze impact of text pretraining on shared representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse autoencoders extract interpretable neural features

Novel indicator enables cross-modality SAE feature comparison

Quantifies shared features between visual and textual encoders

🔎 Similar Papers

Pre-trained Vision-Language Models Learn Discoverable Visual Concepts