๐ค AI Summary
This study addresses the susceptibility of large language models (LLMs) to non-deterministic cultural cues in medical question answering, which can introduce diagnostic bias and compromise clinical accuracy and fairness. The authors present the first clinically validated counterfactual evaluation framework, systematically injecting cultural identity markers and contextual cues into MedQA items to quantify their individual and synergistic effects on model performance. Leveraging counterfactual data augmentation, LLM-as-judge automated scoring (ฮบ=0.76), multi-model comparisons (e.g., GPT-5.2, Llama-3.1-8B), and dual-prompt strategies targeting both answer choices and explanations, the research demonstrates that cultural cues significantly reduce accuracy (p<10โปยนโด), with the largest declines (3โ7 percentage points) occurring when both cue types co-occur. Moreover, over half of the culturally influenced explanations lead to erroneous diagnoses.
๐ Abstract
Engineering sustainable and equitable healthcare requires medical language models that do not change clinically correct diagnoses when presented with non-decisive cultural information. We introduce a counterfactual benchmark that expands 150 MedQA test items into 1650 variants by inserting culture-related (i) identifier tokens, (ii) contextual cues, or (iii) their combination for three groups (Indigenous Canadian, Middle-Eastern Muslim, Southeast Asian), plus a length-matched neutral control, where a clinician verified that the gold answer remains invariant in all variants. We evaluate GPT-5.2, Llama-3.1-8B, DeepSeek-R1, and MedGemma (4B/27B) under option-only and brief-explanation prompting. Across models, cultural cues significantly affect accuracy (Cochran's Q, $p<10^-14$), with the largest degradation when identifier and context co-occur (up to 3-7 percentage points under option-only prompting), while neutral edits produce smaller, non-systematic changes. A human-validated rubric ($\kappa=0.76$) applied via an LLM-as-judge shows that more than half of culturally grounded explanations end in an incorrect answer, linking culture-referential reasoning to diagnostic failure. We release prompts and augmentations to support evaluation and mitigation of culturally induced diagnostic errors.