Closing the Modality Gap Aligns Group-Wise Semantics

📅 2026-01-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the detrimental impact of modality gaps in multimodal learning, which induce structural inconsistencies in the shared latent space and significantly degrade performance on group-level semantic tasks such as clustering. The authors propose a novel approach designed for two modalities and naturally extensible to n modalities, enhancing the alignment mechanism within the CLIP framework and introducing a new loss function to effectively reduce inter-modal structural discrepancies. Experimental results demonstrate substantial performance gains across multiple group-level tasks, while yielding only marginal or inconsistent improvements on conventional instance-level tasks. These findings underscore the critical role of modality gaps in semantic grouping and challenge the prevailing notion that such gaps are inconsequential.

Technology Category

Application Category

📝 Abstract
In multimodal learning, CLIP has been recognized as the \textit{de facto} method for learning a shared latent space across multiple modalities, placing similar representations close to each other and moving them away from dissimilar ones. Although CLIP-based losses effectively align modalities at the semantic level, the resulting latent spaces often remain only partially shared, revealing a structural mismatch known as the modality gap. While the necessity of addressing this phenomenon remains debated, particularly given its limited impact on instance-wise tasks (e.g., retrieval), we prove that its influence is instead strongly pronounced in group-level tasks (e.g., clustering). To support this claim, we introduce a novel method designed to consistently reduce this discrepancy in two-modal settings, with a straightforward extension to the general $n$-modal case. Through our extensive evaluation, we demonstrate our novel insight: while reducing the gap provides only marginal or inconsistent improvements in traditional instance-wise tasks, it significantly enhances group-wise tasks. These findings may reshape our understanding of the modality gap, highlighting its key role in improving performance on tasks requiring semantic grouping.
Problem

Research questions and friction points this paper is trying to address.

modality gap
multimodal learning
group-wise tasks
semantic alignment
latent space
Innovation

Methods, ideas, or system contributions that make the work stand out.

modality gap
multimodal learning
group-wise semantics
CLIP
semantic alignment
🔎 Similar Papers
No similar papers found.