🤖 AI Summary
Large language models (LLMs) lack systematic evaluation in real-world clinical decision support—specifically, their ability to recommend appropriate medical calculators (e.g., risk stratification or diagnostic tools) under authentic clinical scenarios. Method: We constructed a multiple-choice benchmark covering diverse clinical decision tasks, rigorously annotated by clinicians, and conducted comparative evaluation against medical students and domain experts. Error analysis included fine-grained attribution to clinical semantic misinterpretation and statistical estimation of confidence intervals. Contribution/Results: The state-of-the-art model (OpenAI o1) achieved only 66.0% accuracy—significantly below the human average (79.5%). Nearly half of all errors (49.3%) stemmed from deficits in clinical semantic understanding. This work identifies a critical bottleneck in LLMs’ clinical utility and proposes a novel evaluation paradigm centered on clinical semantic fidelity. It establishes both a methodological foundation and an empirical benchmark for assessing LLM clinical adaptability.
📝 Abstract
Although large language models (LLMs) have been assessed for general medical knowledge using licensing exams, their ability to support clinical decision-making, such as selecting medical calculators, remains uncertain. We assessed nine LLMs, including open-source, proprietary, and domain-specific models, with 1,009 multiple-choice question-answer pairs across 35 clinical calculators and compared LLMs to humans on a subset of questions. While the highest-performing LLM, OpenAI o1, provided an answer accuracy of 66.0% (CI: 56.7-75.3%) on the subset of 100 questions, two human annotators nominally outperformed LLMs with an average answer accuracy of 79.5% (CI: 73.5-85.0%). Ultimately, we evaluated medical trainees and LLMs in recommending medical calculators across clinical scenarios like risk stratification and diagnosis. With error analysis showing that the highest-performing LLMs continue to make mistakes in comprehension (49.3% of errors) and calculator knowledge (7.1% of errors), our findings highlight that LLMs are not superior to humans in calculator recommendation.