Seeing Through Words, Speaking Through Pixels: Deep Representational Alignment Between Vision and Language Models

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the semantic alignment mechanism between vision and language deep models under unsupervised conditions. To this end, we conduct deep representation analysis, cross-modal similarity modeling, Pick-a-Pic forced-choice evaluation, and multi-caption/image matching assessment. Results show that semantic alignment peaks at middle-to-late network layers, exhibiting strong semantic sensitivity and robustness to visual appearance variations. Moreover, averaging representations across multiple instances significantly enhances alignment strength—surpassing conventional one-to-one pairing paradigms and better reflecting human fine-grained preferences in many-to-many image-text scenarios. Key contributions include: (1) the first empirical confirmation that unimodal models encode a shared semantic structure consistent with human judgments; and (2) the discovery that aggregating multiple examples improves alignment quality, with substantial gains achieved while preserving semantic fidelity.

Technology Category

Application Category

📝 Abstract
Recent studies show that deep vision-only and language-only models--trained on disjoint modalities--nonetheless project their inputs into a partially aligned representational space. Yet we still lack a clear picture of where in each network this convergence emerges, what visual or linguistic cues support it, whether it captures human preferences in many-to-many image-text scenarios, and how aggregating exemplars of the same concept affects alignment. Here, we systematically investigate these questions. We find that alignment peaks in mid-to-late layers of both model types, reflecting a shift from modality-specific to conceptually shared representations. This alignment is robust to appearance-only changes but collapses when semantics are altered (e.g., object removal or word-order scrambling), highlighting that the shared code is truly semantic. Moving beyond the one-to-one image-caption paradigm, a forced-choice "Pick-a-Pic" task shows that human preferences for image-caption matches are mirrored in the embedding spaces across all vision-language model pairs. This pattern holds bidirectionally when multiple captions correspond to a single image, demonstrating that models capture fine-grained semantic distinctions akin to human judgments. Surprisingly, averaging embeddings across exemplars amplifies alignment rather than blurring detail. Together, our results demonstrate that unimodal networks converge on a shared semantic code that aligns with human judgments and strengthens with exemplar aggregation.
Problem

Research questions and friction points this paper is trying to address.

Investigating where alignment emerges in vision and language networks
Examining how semantic changes affect cross-modal representational alignment
Testing whether models capture human preferences in image-text matching
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mid-to-late layers achieve peak representational alignment
Shared semantic code is robust to appearance changes
Embedding averaging amplifies alignment across exemplars
🔎 Similar Papers
No similar papers found.