Interpreting the Linear Structure of Vision-language Model Embedding Spaces

📅 2025-04-16

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work investigates how semantics and modalities are organized within the joint embedding space of vision-language models (VLMs). Addressing the core question—“How are images and text structurally represented in a shared semantic space?”—we introduce a geometric analysis framework based on sparse autoencoders (SAEs), constructing high-fidelity, highly sparse concept dictionaries across four major VLMs (CLIP, SigLIP, etc.). We propose the Bridge Score, a novel metric quantifying cross-modal conceptual synergy, and discover that sparse SAE directions correspond to stable, interpretable semantic bridges—challenging the assumption of purely modality-specific encoding. Experiments demonstrate that these key concepts exhibit strong cross-training-run stability. To support further research, we open-source an interactive platform for concept exploration. Our approach establishes a new paradigm for VLM interpretability and semantic disentanglement.

Technology Category

Application Category

📝 Abstract

Vision-language models encode images and text in a joint space, minimizing the distance between corresponding image and text pairs. How are language and images organized in this joint space, and how do the models encode meaning and modality? To investigate this, we train and release sparse autoencoders (SAEs) on the embedding spaces of four vision-language models (CLIP, SigLIP, SigLIP2, and AIMv2). SAEs approximate model embeddings as sparse linear combinations of learned directions, or"concepts". We find that, compared to other methods of linear feature learning, SAEs are better at reconstructing the real embeddings, while also able to retain the most sparsity. Retraining SAEs with different seeds or different data diet leads to two findings: the rare, specific concepts captured by the SAEs are liable to change drastically, but we also show that the key commonly-activating concepts extracted by SAEs are remarkably stable across runs. Interestingly, while most concepts are strongly unimodal in activation, we find they are not merely encoding modality per se. Many lie close to - but not entirely within - the subspace defining modality, suggesting that they encode cross-modal semantics despite their unimodal usage. To quantify this bridging behavior, we introduce the Bridge Score, a metric that identifies concept pairs which are both co-activated across aligned image-text inputs and geometrically aligned in the shared space. This reveals that even unimodal concepts can collaborate to support cross-modal integration. We release interactive demos of the SAEs for all models, allowing researchers to explore the organization of the concept spaces. Overall, our findings uncover a sparse linear structure within VLM embedding spaces that is shaped by modality, yet stitched together through latent bridges-offering new insight into how multimodal meaning is constructed.

Problem

Research questions and friction points this paper is trying to address.

Analyze organization of language and images in joint embedding spaces

Investigate sparse autoencoders' stability and cross-modal semantic encoding

Introduce Bridge Score to quantify cross-modal concept collaboration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse autoencoders analyze vision-language model embeddings

Bridge Score metric identifies cross-modal concept pairs

Interactive demos released for exploring concept spaces

🔎 Similar Papers

No similar papers found.

Authors to Follow