Fairness and Robustness of CLIP-Based Models for Chest X-rays

📅 2025-07-28

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This study systematically evaluates the fairness and robustness of six CLIP-based architectures for chest X-ray classification. Addressing bias across sensitive attributes—including age, sex, and race—we observe substantial performance disparities across age groups, while models exhibit relatively equitable behavior with respect to sex and race. All models consistently degrade on pneumothorax samples lacking chest tubes, revealing strong reliance on the spurious “chest tube presence” shortcut feature. Experiments leverage three large-scale datasets—MIMIC-CXR, NIH-CXR14, and NEATX—and integrate embedding-space analysis with multi-task evaluation. Notably, conventional visualization techniques (e.g., PCA) prove inadequate for attributing model decisions to sensitive attributes. To our knowledge, this is the first work to quantitatively characterize shortcut learning and subgroup bias in CLIP models applied to medical imaging, providing critical empirical evidence for developing trustworthy multimodal AI in healthcare.

Technology Category

Application Category

📝 Abstract

Motivated by the strong performance of CLIP-based models in natural image-text domains, recent efforts have adapted these architectures to medical tasks, particularly in radiology, where large paired datasets of images and reports, such as chest X-rays, are available. While these models have shown encouraging results in terms of accuracy and discriminative performance, their fairness and robustness in the different clinical tasks remain largely underexplored. In this study, we extensively evaluate six widely used CLIP-based models on chest X-ray classification using three publicly available datasets: MIMIC-CXR, NIH-CXR14, and NEATX. We assess the models fairness across six conditions and patient subgroups based on age, sex, and race. Additionally, we assess the robustness to shortcut learning by evaluating performance on pneumothorax cases with and without chest drains. Our results indicate performance gaps between patients of different ages, but more equitable results for the other attributes. Moreover, all models exhibit lower performance on images without chest drains, suggesting reliance on spurious correlations. We further complement the performance analysis with a study of the embeddings generated by the models. While the sensitive attributes could be classified from the embeddings, we do not see such patterns using PCA, showing the limitations of these visualisation techniques when assessing models. Our code is available at https://github.com/TheoSourget/clip_cxr_fairness

Problem

Research questions and friction points this paper is trying to address.

Evaluates fairness of CLIP-based models in chest X-ray classification

Assesses robustness to shortcut learning in pneumothorax cases

Analyzes embedding patterns for sensitive attributes and visualization limits

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates CLIP-based models on chest X-rays

Assesses fairness across age, sex, race

Tests robustness to shortcut learning

🔎 Similar Papers

RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-training