🤖 AI Summary
This work addresses the vulnerability of large language models (LLMs) in medical question answering to misleading contextual cues, which can lead to unsafe deviations from correct clinical judgment. Introducing the novel concept of “cognitive resilience,” the study presents MedMisBench—a benchmark featuring adversarially crafted misleading contexts, including authoritative false statements and exception contamination—to evaluate model robustness across dimensions of medical reasoning, agent capabilities, and patient journey simulation. A clinical expert review protocol is integrated to assess potential harm. Experiments on 11 prominent LLMs reveal a sharp drop in accuracy from 71.1% to 38.0% under attack, with adversarial success rates reaching up to 69.5%. Notably, 38.2% of erroneous responses were rated by experts as posing serious potential clinical harm, exposing a critical blind spot in current evaluation frameworks.
📝 Abstract
Large language models (LLMs) now reach expert-level scores on medical licensing exams, encouraging the assumption that high scores imply safe medical judgment while patients increasingly use them for health advice. We show this assumption is fragile: when misleading context is injected into questions that LLMs originally answer correctly, they abandon the correct answer. We call the ability to maintain correct judgment under adversarial context epistemic resilience, and introduce MedMisBench to measure it. MedMisBench contains 10,932 medical question items and 48,889 misleading context-option pairs spanning medical reasoning, agentic capability, and patient-journey evaluation. Across 11 model configurations, mean accuracy falls from 71.1% on original questions to 38.0% under focused misleading context, with 51.5% attack success. The most damaging injections are formal, rule-like fabrications: authority-framed falsehoods reach 69.5% attack success and exception-poisoning claims reach 64.1%. A 14-member clinical panel from 7 countries identified serious potential harm in 38.2% of reviewed cases. MedMisBench exposes a structural blind spot in LLM evaluation in medical settings: existing benchmarks measure what models know, but not whether they preserve correct medical judgment under misleading context.