MedProbCLIP: Probabilistic Adaptation of Vision-Language Foundation Model for Reliable Radiograph-Report Retrieval

📅 2026-02-17

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing deterministic vision–language models struggle to capture the many-to-many correspondence and inherent uncertainty between medical images and radiology reports, limiting their reliability in high-stakes clinical settings. To address this, this work proposes MedProbCLIP, the first approach to integrate probabilistic embeddings and the variational information bottleneck into medical vision–language modeling. By leveraging Gaussian embeddings and probabilistic contrastive learning, MedProbCLIP jointly encodes multi-view images and multi-paragraph reports, enabling bidirectional retrieval using only a single image–report pair. The method effectively mitigates overconfidence, improves calibration, and significantly outperforms strong baselines—including CLIP, CXR-CLIP, and PCME++—on MIMIC-CXR across multiple metrics: retrieval accuracy, zero-shot classification, selective retrieval reliability, and robustness to clinically relevant perturbations.

Technology Category

Application Category

📝 Abstract

Vision-language foundation models have emerged as powerful general-purpose representation learners with strong potential for multimodal understanding, but their deterministic embeddings often fail to provide the reliability required for high-stakes biomedical applications. This work introduces MedProbCLIP, a probabilistic vision-language learning framework for chest X-ray and radiology report representation learning and bidirectional retrieval. MedProbCLIP models image and text representations as Gaussian embeddings through a probabilistic contrastive objective that explicitly captures uncertainty and many-to-many correspondences between radiographs and clinical narratives. A variational information bottleneck mitigates overconfident predictions, while MedProbCLIP employs multi-view radiograph encoding and multi-section report encoding during training to provide fine-grained supervision for clinically aligned correspondence, yet requires only a single radiograph and a single report at inference. Evaluated on the MIMIC-CXR dataset, MedProbCLIP outperforms deterministic and probabilistic baselines, including CLIP, CXR-CLIP, and PCME++, in both retrieval and zero-shot classification. Beyond accuracy, MedProbCLIP demonstrates superior calibration, risk-coverage behavior, selective retrieval reliability, and robustness to clinically relevant corruptions, underscoring the value of probabilistic vision-language modeling for improving the trustworthiness and safety of radiology image-text retrieval systems.

Problem

Research questions and friction points this paper is trying to address.

vision-language models

radiograph-report retrieval

probabilistic representation

medical reliability

uncertainty modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

probabilistic vision-language modeling

Gaussian embeddings

contrastive learning

variational information bottleneck

radiograph-report retrieval

🔎 Similar Papers

No similar papers found.

Authors to Follow