Humans and Large Language Models in Clinical Decision Support: A Study with Medical Calculators

📅 2024-11-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) lack systematic evaluation in real-world clinical decision support—specifically, their ability to recommend appropriate medical calculators (e.g., risk stratification or diagnostic tools) under authentic clinical scenarios. Method: We constructed a multiple-choice benchmark covering diverse clinical decision tasks, rigorously annotated by clinicians, and conducted comparative evaluation against medical students and domain experts. Error analysis included fine-grained attribution to clinical semantic misinterpretation and statistical estimation of confidence intervals. Contribution/Results: The state-of-the-art model (OpenAI o1) achieved only 66.0% accuracy—significantly below the human average (79.5%). Nearly half of all errors (49.3%) stemmed from deficits in clinical semantic understanding. This work identifies a critical bottleneck in LLMs’ clinical utility and proposes a novel evaluation paradigm centered on clinical semantic fidelity. It establishes both a methodological foundation and an empirical benchmark for assessing LLM clinical adaptability.

Technology Category

Application Category

📝 Abstract
Although large language models (LLMs) have been assessed for general medical knowledge using licensing exams, their ability to support clinical decision-making, such as selecting medical calculators, remains uncertain. We assessed nine LLMs, including open-source, proprietary, and domain-specific models, with 1,009 multiple-choice question-answer pairs across 35 clinical calculators and compared LLMs to humans on a subset of questions. While the highest-performing LLM, OpenAI o1, provided an answer accuracy of 66.0% (CI: 56.7-75.3%) on the subset of 100 questions, two human annotators nominally outperformed LLMs with an average answer accuracy of 79.5% (CI: 73.5-85.0%). Ultimately, we evaluated medical trainees and LLMs in recommending medical calculators across clinical scenarios like risk stratification and diagnosis. With error analysis showing that the highest-performing LLMs continue to make mistakes in comprehension (49.3% of errors) and calculator knowledge (7.1% of errors), our findings highlight that LLMs are not superior to humans in calculator recommendation.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to support clinical decision-making
Comparing LLMs and humans in medical calculator selection
Assessing LLM errors in comprehension and calculator knowledge
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated nine LLMs for clinical calculator selection
Compared LLMs to humans on medical questions
Analyzed LLM errors in comprehension and knowledge
🔎 Similar Papers
No similar papers found.
N
Nicholas Wan
Division of Intramural Research, National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA.
Q
Qiao Jin
Division of Intramural Research, National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA.
J
Joey Chan
Division of Intramural Research, National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA.
Guangzhi Xiong
Guangzhi Xiong
University of Virginia
S
Serina Applebaum
Department of Ophthalmology & Visual Science, Yale School of Medicine, New Haven, CT, USA.
Aidan Gilson
Aidan Gilson
Massachusetts Eye and Ear, Harvard Medical School
OphthalmologyMachine LearningArtificial Intelligence
R
Reid McMurry
Department of Emergency Medicine, Boston Medical Center, Boston, MA, USA.
R
R. Andrew Taylor
Department of Biomedical Informatics & Data Science, Yale School of Medicine, New Haven, CT, USA. Department of Emergency Medicine, Yale School of Medicine, New Haven, CT, USA.
A
Aidong Zhang
Department of Computer Science, University of Virginia, Charlottesville, VA, USA.
Qingyu Chen
Qingyu Chen
Biomedical Informatics & Data Science, Yale University; NCBI-NLM, National Institutes of Health
Text miningMachine learningData curationBioNLPMedical Imaging Analysis
Zhiyong Lu
Zhiyong Lu
Senior Investigator, NLM; Adjunct Professor of CS, UIUC
BioNLPBiomedical InformaticsMedical AIArtificial Intelligence