Multimodal Representation Alignment for Cross-modal Information Retrieval

πŸ“… 2025-06-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Cross-modal retrieval suffers from semantic misalignment due to geometric inconsistency between image and text representations. This work formulates image–text matching as a metric alignment problem in the embedding space. We systematically demonstrate that: (1) the Wasserstein distance quantitatively characterizes inter-modal distributional discrepancy; (2) cosine similarity exhibits superior robustness over Euclidean distance, KL divergence, and other conventional metrics for alignment; and (3) standard MLPs inadequately capture complex cross-modal interactions. To address these issues, we propose a lightweight neural metric learning framework that integrates vision-language models with unimodal encoders, jointly optimizing four classical metrics and two learnable neural metrics. Our approach achieves significant improvements in Recall@K across standard cross-modal retrieval benchmarks. Moreover, it provides practical, deployment-oriented alignment evaluation criteria and architectural design guidelines.

Technology Category

Application Category

πŸ“ Abstract
Different machine learning models can represent the same underlying concept in different ways. This variability is particularly valuable for in-the-wild multimodal retrieval, where the objective is to identify the corresponding representation in one modality given another modality as input. This challenge can be effectively framed as a feature alignment problem. For example, given a sentence encoded by a language model, retrieve the most semantically aligned image based on features produced by an image encoder, or vice versa. In this work, we first investigate the geometric relationships between visual and textual embeddings derived from both vision-language models and combined unimodal models. We then align these representations using four standard similarity metrics as well as two learned ones, implemented via neural networks. Our findings indicate that the Wasserstein distance can serve as an informative measure of the modality gap, while cosine similarity consistently outperforms alternative metrics in feature alignment tasks. Furthermore, we observe that conventional architectures such as multilayer perceptrons are insufficient for capturing the complex interactions between image and text representations. Our study offers novel insights and practical considerations for researchers working in multimodal information retrieval, particularly in real-world, cross-modal applications.
Problem

Research questions and friction points this paper is trying to address.

Align multimodal representations for cross-modal retrieval
Measure modality gap using Wasserstein distance
Improve feature alignment with cosine similarity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligns multimodal representations using similarity metrics
Uses Wasserstein distance to measure modality gap
Demonstrates cosine similarity's superiority in alignment
F
Fan Xu
Department of Computer Science, University of Luxembourg, 6, avenue de la Fonte, Esch-sur-Alzette, L-4364, Luxembourg
Luis A. Leiva
Luis A. Leiva
University of Luxembourg
Human-Computer InteractionMachine LearningComputational InteractionBio-signal processing