🤖 AI Summary
This study systematically evaluates BiomedCLIP’s representational capacity and limitations on the highly imbalanced, out-of-distribution IU-Xray multi-label radiology image dataset. Addressing zero-shot inference, full fine-tuning, and linear probing paradigms, we identify significant over-prediction bias and insufficient inter-class separability. We propose—first in medical vision-language modeling—an explainability validation framework integrating radiologist annotations with Grad-CAM, enabling quantitative analysis of model decision rationale. Results show: (1) suboptimal zero-shot performance; (2) full fine-tuning markedly improves specificity in disease identification; (3) linear probing better captures co-occurring lesion patterns; and (4) model predictions achieve 78.3% F1-score agreement with expert annotations. Collectively, this work establishes a novel methodological framework for reliability assessment and improvement of medical vision-language models in clinical settings, grounded in empirical evaluation across diverse inference paradigms.
📝 Abstract
In this paper, we construct two research objectives: i) explore the learned embedding space of BiomedCLIP, an open-source large vision language model, to analyse meaningful class separations, and ii) quantify the limitations of BiomedCLIP when applied to a highly imbalanced, out-of-distribution multi-label medical dataset. We experiment on IU-xray dataset, which exhibits the aforementioned criteria, and evaluate BiomedCLIP in classifying images (radiographs) in three contexts: zero-shot inference, full finetuning, and linear probing. The results show that the model under zero-shot settings over-predicts all labels, leading to poor precision and inter-class separability. Full fine-tuning improves classification of distinct diseases, while linear probing detects overlapping features. We demonstrate visual understanding of the model using Grad-CAM heatmaps and compare with 15 annotations by a radiologist. We highlight the need for careful adaptations of the models to foster reliability and applicability in a real-world setting. The code for the experiments in this work is available and maintained on GitHub.