Counterfactual Cultural Cues Reduce Medical QA Accuracy in LLMs: Identifier vs Context Effects

๐Ÿ“… 2026-01-27
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study addresses the susceptibility of large language models (LLMs) to non-deterministic cultural cues in medical question answering, which can introduce diagnostic bias and compromise clinical accuracy and fairness. The authors present the first clinically validated counterfactual evaluation framework, systematically injecting cultural identity markers and contextual cues into MedQA items to quantify their individual and synergistic effects on model performance. Leveraging counterfactual data augmentation, LLM-as-judge automated scoring (ฮบ=0.76), multi-model comparisons (e.g., GPT-5.2, Llama-3.1-8B), and dual-prompt strategies targeting both answer choices and explanations, the research demonstrates that cultural cues significantly reduce accuracy (p<10โปยนโด), with the largest declines (3โ€“7 percentage points) occurring when both cue types co-occur. Moreover, over half of the culturally influenced explanations lead to erroneous diagnoses.

Technology Category

Application Category

๐Ÿ“ Abstract
Engineering sustainable and equitable healthcare requires medical language models that do not change clinically correct diagnoses when presented with non-decisive cultural information. We introduce a counterfactual benchmark that expands 150 MedQA test items into 1650 variants by inserting culture-related (i) identifier tokens, (ii) contextual cues, or (iii) their combination for three groups (Indigenous Canadian, Middle-Eastern Muslim, Southeast Asian), plus a length-matched neutral control, where a clinician verified that the gold answer remains invariant in all variants. We evaluate GPT-5.2, Llama-3.1-8B, DeepSeek-R1, and MedGemma (4B/27B) under option-only and brief-explanation prompting. Across models, cultural cues significantly affect accuracy (Cochran's Q, $p<10^-14$), with the largest degradation when identifier and context co-occur (up to 3-7 percentage points under option-only prompting), while neutral edits produce smaller, non-systematic changes. A human-validated rubric ($\kappa=0.76$) applied via an LLM-as-judge shows that more than half of culturally grounded explanations end in an incorrect answer, linking culture-referential reasoning to diagnostic failure. We release prompts and augmentations to support evaluation and mitigation of culturally induced diagnostic errors.
Problem

Research questions and friction points this paper is trying to address.

counterfactual
cultural cues
medical QA
diagnostic accuracy
language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

counterfactual benchmark
cultural bias
medical QA
LLM-as-judge
diagnostic stability
๐Ÿ”Ž Similar Papers
No similar papers found.
A
Amirhossein Haji Mohammad Rezaei
Institute of Health Policy, Management, and Evaluation (IHPME), Dalla Lana School of Public Health, University of Toronto, Canada
Zahra Shakeri
Zahra Shakeri
University of Toronto
Health InformaticsAI for HealthcareInformation VisualizationDigital Health