On the Robustness of Medical Vision-Language Models: Are they Truly Generalizable?

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Medical vision-language models (MVLMs) exhibit insufficient robustness under realistic clinical noise and artifacts, yet existing evaluations predominantly rely on clean data and lack systematic distortion testing. Method: We introduce MediMeta-C/MedMNIST-C—the first medical multimodal corruption robustness benchmark—revealing substantial performance degradation across five imaging modalities. Building on these findings, we propose RobustMedCLIP, which jointly employs low-rank adaptation (LoRA) of the visual encoder and few-shot prompt tuning to enhance corruption robustness without compromising cross-modal generalization. Contribution/Results: Extensive experiments demonstrate that RobustMedCLIP achieves an average 12.7% improvement in corruption-robust accuracy while preserving original task performance—marking the first successful co-optimization of robustness and generalization for MVLMs.

Technology Category

Application Category

📝 Abstract

Medical Vision-Language Models (MVLMs) have achieved par excellence generalization in medical image analysis, yet their performance under noisy, corrupted conditions remains largely untested. Clinical imaging is inherently susceptible to acquisition artifacts and noise; however, existing evaluations predominantly assess generally clean datasets, overlooking robustness -- i.e., the model's ability to perform under real-world distortions. To address this gap, we first introduce MediMeta-C, a corruption benchmark that systematically applies several perturbations across multiple medical imaging datasets. Combined with MedMNIST-C, this establishes a comprehensive robustness evaluation framework for MVLMs. We further propose RobustMedCLIP, a visual encoder adaptation of a pretrained MVLM that incorporates few-shot tuning to enhance resilience against corruptions. Through extensive experiments, we benchmark 5 major MVLMs across 5 medical imaging modalities, revealing that existing models exhibit severe degradation under corruption and struggle with domain-modality tradeoffs. Our findings highlight the necessity of diverse training and robust adaptation strategies, demonstrating that efficient low-rank adaptation when paired with few-shot tuning, improves robustness while preserving generalization across modalities.

Problem

Research questions and friction points this paper is trying to address.

Assessing robustness of Medical Vision-Language Models under noisy conditions

Evaluating model performance on corrupted medical imaging datasets

Enhancing resilience of MVLMs against real-world distortions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MediMeta-C benchmark for corruption testing

Proposes RobustMedCLIP for enhanced corruption resilience

Uses few-shot tuning with low-rank adaptation

🔎 Similar Papers

No similar papers found.

Authors to Follow