đ€ AI Summary
This study systematically evaluates the fairness and robustness of six CLIP-based architectures for chest X-ray classification. Addressing bias across sensitive attributesâincluding age, sex, and raceâwe observe substantial performance disparities across age groups, while models exhibit relatively equitable behavior with respect to sex and race. All models consistently degrade on pneumothorax samples lacking chest tubes, revealing strong reliance on the spurious âchest tube presenceâ shortcut feature. Experiments leverage three large-scale datasetsâMIMIC-CXR, NIH-CXR14, and NEATXâand integrate embedding-space analysis with multi-task evaluation. Notably, conventional visualization techniques (e.g., PCA) prove inadequate for attributing model decisions to sensitive attributes. To our knowledge, this is the first work to quantitatively characterize shortcut learning and subgroup bias in CLIP models applied to medical imaging, providing critical empirical evidence for developing trustworthy multimodal AI in healthcare.
đ Abstract
Motivated by the strong performance of CLIP-based models in natural image-text domains, recent efforts have adapted these architectures to medical tasks, particularly in radiology, where large paired datasets of images and reports, such as chest X-rays, are available. While these models have shown encouraging results in terms of accuracy and discriminative performance, their fairness and robustness in the different clinical tasks remain largely underexplored. In this study, we extensively evaluate six widely used CLIP-based models on chest X-ray classification using three publicly available datasets: MIMIC-CXR, NIH-CXR14, and NEATX. We assess the models fairness across six conditions and patient subgroups based on age, sex, and race. Additionally, we assess the robustness to shortcut learning by evaluating performance on pneumothorax cases with and without chest drains. Our results indicate performance gaps between patients of different ages, but more equitable results for the other attributes. Moreover, all models exhibit lower performance on images without chest drains, suggesting reliance on spurious correlations. We further complement the performance analysis with a study of the embeddings generated by the models. While the sensitive attributes could be classified from the embeddings, we do not see such patterns using PCA, showing the limitations of these visualisation techniques when assessing models. Our code is available at https://github.com/TheoSourget/clip_cxr_fairness