Closing the gap in multimodal medical representation alignment

📅 2026-02-23

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the significant modality gap between medical imaging and clinical text in shared representation spaces, which leads to insufficient semantic alignment and hampers cross-modal retrieval and understanding performance. To tackle this challenge, the authors propose a modality-agnostic contrastive learning framework that systematically mitigates modality discrepancies in medical settings through optimized embedding space geometry and joint modeling strategies. This approach overcomes the limitations of conventional CLIP-based methods in medical domains and achieves, for the first time, a general and efficient semantic alignment between medical images and clinical text. Experimental results demonstrate substantial improvements in both cross-modal retrieval accuracy on radiology images paired with clinical reports and the quality of generated image captions.

Technology Category

Application Category

📝 Abstract

In multimodal learning, CLIP has emerged as the de-facto approach for mapping different modalities into a shared latent space by bringing semantically similar representations closer while pushing apart dissimilar ones. However, CLIP-based contrastive losses exhibit unintended behaviors that negatively impact true semantic alignment, leading to sparse and fragmented latent spaces. This phenomenon, known as the modality gap, has been partially mitigated for standard text and image pairs but remains unknown and unresolved in more complex multimodal settings, such as the medical domain. In this work, we study this phenomenon in the latter case, revealing that the modality gap is present also in medical alignment, and we propose a modality-agnostic framework that closes this gap, ensuring that semantically related representations are more aligned, regardless of their source modality. Our method enhances alignment between radiology images and clinical text, improving cross-modal retrieval and image captioning.

Problem

Research questions and friction points this paper is trying to address.

modality gap

multimodal alignment

medical representation

CLIP

cross-modal retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

modality gap

multimodal alignment

medical representation learning