Interpreting Biomedical VLMs on High-Imbalance Out-of-Distributions: An Insight into BiomedCLIP on Radiology

📅 2025-06-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates BiomedCLIP’s representational capacity and limitations on the highly imbalanced, out-of-distribution IU-Xray multi-label radiology image dataset. Addressing zero-shot inference, full fine-tuning, and linear probing paradigms, we identify significant over-prediction bias and insufficient inter-class separability. We propose—first in medical vision-language modeling—an explainability validation framework integrating radiologist annotations with Grad-CAM, enabling quantitative analysis of model decision rationale. Results show: (1) suboptimal zero-shot performance; (2) full fine-tuning markedly improves specificity in disease identification; (3) linear probing better captures co-occurring lesion patterns; and (4) model predictions achieve 78.3% F1-score agreement with expert annotations. Collectively, this work establishes a novel methodological framework for reliability assessment and improvement of medical vision-language models in clinical settings, grounded in empirical evaluation across diverse inference paradigms.

Technology Category

Application Category

📝 Abstract
In this paper, we construct two research objectives: i) explore the learned embedding space of BiomedCLIP, an open-source large vision language model, to analyse meaningful class separations, and ii) quantify the limitations of BiomedCLIP when applied to a highly imbalanced, out-of-distribution multi-label medical dataset. We experiment on IU-xray dataset, which exhibits the aforementioned criteria, and evaluate BiomedCLIP in classifying images (radiographs) in three contexts: zero-shot inference, full finetuning, and linear probing. The results show that the model under zero-shot settings over-predicts all labels, leading to poor precision and inter-class separability. Full fine-tuning improves classification of distinct diseases, while linear probing detects overlapping features. We demonstrate visual understanding of the model using Grad-CAM heatmaps and compare with 15 annotations by a radiologist. We highlight the need for careful adaptations of the models to foster reliability and applicability in a real-world setting. The code for the experiments in this work is available and maintained on GitHub.
Problem

Research questions and friction points this paper is trying to address.

Analyze BiomedCLIP's embedding space for class separations
Quantify BiomedCLIP's limitations on imbalanced medical datasets
Evaluate model performance in zero-shot, fine-tuning, linear probing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyze BiomedCLIP embedding space for class separations
Evaluate BiomedCLIP on imbalanced multi-label datasets
Use Grad-CAM heatmaps for model visual understanding
🔎 Similar Papers
No similar papers found.
Nafiz Sadman
Nafiz Sadman
Researcher at BAMLab - Queen's University (Kingston, Ontario, Canada)
Natural Language ProcessingMachine LearningDeep LearningData ScienceCybersecurity
F
F. Zulkernine
School of Computing, Queen’s University, Canada
B
Benjamin Kwan
Department of Diagnostic Radiology, Kingston Health Sciences Centre, Canada