🤖 AI Summary
This work identifies context-induced performance disparities across demographic subgroups in vision-language models (VLMs) applied to medical diagnosis tasks—specifically skin lesion malignancy prediction and pneumothorax detection. Through experiments on real-world medical imaging data, subgroup sensitivity analysis, prompt manipulation, and ablation studies controlling for baseline disease prevalence, we demonstrate that in-context learning (ICL) not only induces reliance on subgroup-specific disease base rates but also introduces systematic, base-rate-independent biases that substantially amplify inter-subgroup performance gaps. We propose, for the first time, a “subgroup-matched label distribution” prompting principle—wherein training examples in ICL demonstrations are sampled to mirror the label distribution of each target subgroup—and empirically validate its effectiveness in mitigating bias. This study provides actionable, evidence-based prompting guidelines to enhance fairness and generalizability of VLMs in clinical deployment.
📝 Abstract
Vision language models (VLMs) show promise in medical diagnosis, but their performance across demographic subgroups when using in-context learning (ICL) remains poorly understood. We examine how the demographic composition of demonstration examples affects VLM performance in two medical imaging tasks: skin lesion malignancy prediction and pneumothorax detection from chest radiographs. Our analysis reveals that ICL influences model predictions through multiple mechanisms: (1) ICL allows VLMs to learn subgroup-specific disease base rates from prompts and (2) ICL leads VLMs to make predictions that perform differently across demographic groups, even after controlling for subgroup-specific disease base rates. Our empirical results inform best-practices for prompting current VLMs (specifically examining demographic subgroup performance, and matching base rates of labels to target distribution at a bulk level and within subgroups), while also suggesting next steps for improving our theoretical understanding of these models.