Counterfactual Cultural Cues Reduce Medical QA Accuracy in LLMs: Identifier vs Context Effects

📅 2026-01-27

📈 Citations: 0

✨ Influential: 0

career value

140K/year

🤖 AI Summary

This study addresses the susceptibility of large language models (LLMs) to non-deterministic cultural cues in medical question answering, which can introduce diagnostic bias and compromise clinical accuracy and fairness. The authors present the first clinically validated counterfactual evaluation framework, systematically injecting cultural identity markers and contextual cues into MedQA items to quantify their individual and synergistic effects on model performance. Leveraging counterfactual data augmentation, LLM-as-judge automated scoring (κ=0.76), multi-model comparisons (e.g., GPT-5.2, Llama-3.1-8B), and dual-prompt strategies targeting both answer choices and explanations, the research demonstrates that cultural cues significantly reduce accuracy (p<10⁻¹⁴), with the largest declines (3–7 percentage points) occurring when both cue types co-occur. Moreover, over half of the culturally influenced explanations lead to erroneous diagnoses.

Technology Category

Application Category

📝 Abstract

Engineering sustainable and equitable healthcare requires medical language models that do not change clinically correct diagnoses when presented with non-decisive cultural information. We introduce a counterfactual benchmark that expands 150 MedQA test items into 1650 variants by inserting culture-related (i) identifier tokens, (ii) contextual cues, or (iii) their combination for three groups (Indigenous Canadian, Middle-Eastern Muslim, Southeast Asian), plus a length-matched neutral control, where a clinician verified that the gold answer remains invariant in all variants. We evaluate GPT-5.2, Llama-3.1-8B, DeepSeek-R1, and MedGemma (4B/27B) under option-only and brief-explanation prompting. Across models, cultural cues significantly affect accuracy (Cochran's Q, $p<10^-14$), with the largest degradation when identifier and context co-occur (up to 3-7 percentage points under option-only prompting), while neutral edits produce smaller, non-systematic changes. A human-validated rubric ($\kappa=0.76$) applied via an LLM-as-judge shows that more than half of culturally grounded explanations end in an incorrect answer, linking culture-referential reasoning to diagnostic failure. We release prompts and augmentations to support evaluation and mitigation of culturally induced diagnostic errors.

Problem

Research questions and friction points this paper is trying to address.

counterfactual

cultural cues

medical QA

diagnostic accuracy

language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

counterfactual benchmark

cultural bias

medical QA