🤖 AI Summary
This study addresses a critical oversight in current evaluation methodologies for consumer health AI triage systems, which often attribute diagnostic failures solely to model capabilities while neglecting the profound influence of evaluation format. Through controlled experiments across 17 clinical scenarios, the authors systematically compare the performance of five state-of-the-art large language models—including GPT-5.2, Claude, and Gemini series—under two distinct paradigms: forced-choice (A/B/C/D) and natural-language interaction. Complemented by target ablation and prompt fidelity analyses, the findings reveal that natural-language interaction improves triage accuracy by an average of 6.4 percentage points, whereas forced-choice formats can cause accuracy for certain models to plummet from 100% to as low as 0–24%. This work provides the first empirical evidence that evaluation format is a primary source of performance distortion in AI triage, advocating for assessment protocols that better reflect real-world user interactions.
📝 Abstract
Ramaswamy et al. reported in \textit{Nature Medicine} that ChatGPT Health under-triages 51.6\% of emergencies, concluding that consumer-facing AI triage poses safety risks. However, their evaluation used an exam-style protocol -- forced A/B/C/D output, knowledge suppression, and suppression of clarifying questions -- that differs fundamentally from how consumers use health chatbots. We tested five frontier LLMs (GPT-5.2, Claude Sonnet 4.6, Claude Opus 4.6, Gemini 3 Flash, Gemini 3.1 Pro) on a 17-scenario partial replication bank under constrained (exam-style, 1,275 trials) and naturalistic (patient-style messages, 850 trials) conditions, with targeted ablations and prompt-faithful checks using the authors' released prompts. Naturalistic interaction improved triage accuracy by 6.4 percentage points ($p = 0.015$). Diabetic ketoacidosis was correctly triaged in 100\% of trials across all models and conditions. Asthma triage improved from 48\% to 80\%. The forced A/B/C/D format was the dominant failure mechanism: three models scored 0--24\% with forced choice but 100\% with free text (all $p < 10^{-8}$), consistently recommending emergency care in their own words while the forced-choice format registered under-triage. Prompt-faithful checks on the authors' exact released prompts confirmed the scaffold produces model-dependent, case-dependent results. The headline under-triage rate is highly contingent on evaluation format and should not be interpreted as a stable estimate of deployed triage behavior. Valid evaluation of consumer health AI requires testing under conditions that reflect actual use.