MedProbCLIP: Probabilistic Adaptation of Vision-Language Foundation Model for Reliable Radiograph-Report Retrieval

📅 2026-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing deterministic vision–language models struggle to capture the many-to-many correspondence and inherent uncertainty between medical images and radiology reports, limiting their reliability in high-stakes clinical settings. To address this, this work proposes MedProbCLIP, the first approach to integrate probabilistic embeddings and the variational information bottleneck into medical vision–language modeling. By leveraging Gaussian embeddings and probabilistic contrastive learning, MedProbCLIP jointly encodes multi-view images and multi-paragraph reports, enabling bidirectional retrieval using only a single image–report pair. The method effectively mitigates overconfidence, improves calibration, and significantly outperforms strong baselines—including CLIP, CXR-CLIP, and PCME++—on MIMIC-CXR across multiple metrics: retrieval accuracy, zero-shot classification, selective retrieval reliability, and robustness to clinically relevant perturbations.

Technology Category

Application Category

📝 Abstract
Vision-language foundation models have emerged as powerful general-purpose representation learners with strong potential for multimodal understanding, but their deterministic embeddings often fail to provide the reliability required for high-stakes biomedical applications. This work introduces MedProbCLIP, a probabilistic vision-language learning framework for chest X-ray and radiology report representation learning and bidirectional retrieval. MedProbCLIP models image and text representations as Gaussian embeddings through a probabilistic contrastive objective that explicitly captures uncertainty and many-to-many correspondences between radiographs and clinical narratives. A variational information bottleneck mitigates overconfident predictions, while MedProbCLIP employs multi-view radiograph encoding and multi-section report encoding during training to provide fine-grained supervision for clinically aligned correspondence, yet requires only a single radiograph and a single report at inference. Evaluated on the MIMIC-CXR dataset, MedProbCLIP outperforms deterministic and probabilistic baselines, including CLIP, CXR-CLIP, and PCME++, in both retrieval and zero-shot classification. Beyond accuracy, MedProbCLIP demonstrates superior calibration, risk-coverage behavior, selective retrieval reliability, and robustness to clinically relevant corruptions, underscoring the value of probabilistic vision-language modeling for improving the trustworthiness and safety of radiology image-text retrieval systems.
Problem

Research questions and friction points this paper is trying to address.

vision-language models
radiograph-report retrieval
probabilistic representation
medical reliability
uncertainty modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

probabilistic vision-language modeling
Gaussian embeddings
contrastive learning
variational information bottleneck
radiograph-report retrieval
🔎 Similar Papers
No similar papers found.