Interpreting Biomedical VLMs on High-Imbalance Out-of-Distributions: An Insight into BiomedCLIP on Radiology

📅 2025-06-17

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study systematically evaluates BiomedCLIP’s representational capacity and limitations on the highly imbalanced, out-of-distribution IU-Xray multi-label radiology image dataset. Addressing zero-shot inference, full fine-tuning, and linear probing paradigms, we identify significant over-prediction bias and insufficient inter-class separability. We propose—first in medical vision-language modeling—an explainability validation framework integrating radiologist annotations with Grad-CAM, enabling quantitative analysis of model decision rationale. Results show: (1) suboptimal zero-shot performance; (2) full fine-tuning markedly improves specificity in disease identification; (3) linear probing better captures co-occurring lesion patterns; and (4) model predictions achieve 78.3% F1-score agreement with expert annotations. Collectively, this work establishes a novel methodological framework for reliability assessment and improvement of medical vision-language models in clinical settings, grounded in empirical evaluation across diverse inference paradigms.

Technology Category

Application Category

📝 Abstract

In this paper, we construct two research objectives: i) explore the learned embedding space of BiomedCLIP, an open-source large vision language model, to analyse meaningful class separations, and ii) quantify the limitations of BiomedCLIP when applied to a highly imbalanced, out-of-distribution multi-label medical dataset. We experiment on IU-xray dataset, which exhibits the aforementioned criteria, and evaluate BiomedCLIP in classifying images (radiographs) in three contexts: zero-shot inference, full finetuning, and linear probing. The results show that the model under zero-shot settings over-predicts all labels, leading to poor precision and inter-class separability. Full fine-tuning improves classification of distinct diseases, while linear probing detects overlapping features. We demonstrate visual understanding of the model using Grad-CAM heatmaps and compare with 15 annotations by a radiologist. We highlight the need for careful adaptations of the models to foster reliability and applicability in a real-world setting. The code for the experiments in this work is available and maintained on GitHub.

Problem

Research questions and friction points this paper is trying to address.

Analyze BiomedCLIP's embedding space for class separations

Quantify BiomedCLIP's limitations on imbalanced medical datasets

Evaluate model performance in zero-shot, fine-tuning, linear probing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyze BiomedCLIP embedding space for class separations

Evaluate BiomedCLIP on imbalanced multi-label datasets

Use Grad-CAM heatmaps for model visual understanding

🔎 Similar Papers

No similar papers found.

Authors to Follow