Humans and Large Language Models in Clinical Decision Support: A Study with Medical Calculators

📅 2024-11-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Large language models (LLMs) lack systematic evaluation in real-world clinical decision support—specifically, their ability to recommend appropriate medical calculators (e.g., risk stratification or diagnostic tools) under authentic clinical scenarios. Method: We constructed a multiple-choice benchmark covering diverse clinical decision tasks, rigorously annotated by clinicians, and conducted comparative evaluation against medical students and domain experts. Error analysis included fine-grained attribution to clinical semantic misinterpretation and statistical estimation of confidence intervals. Contribution/Results: The state-of-the-art model (OpenAI o1) achieved only 66.0% accuracy—significantly below the human average (79.5%). Nearly half of all errors (49.3%) stemmed from deficits in clinical semantic understanding. This work identifies a critical bottleneck in LLMs’ clinical utility and proposes a novel evaluation paradigm centered on clinical semantic fidelity. It establishes both a methodological foundation and an empirical benchmark for assessing LLM clinical adaptability.

Technology Category

Application Category

📝 Abstract

Although large language models (LLMs) have been assessed for general medical knowledge using licensing exams, their ability to support clinical decision-making, such as selecting medical calculators, remains uncertain. We assessed nine LLMs, including open-source, proprietary, and domain-specific models, with 1,009 multiple-choice question-answer pairs across 35 clinical calculators and compared LLMs to humans on a subset of questions. While the highest-performing LLM, OpenAI o1, provided an answer accuracy of 66.0% (CI: 56.7-75.3%) on the subset of 100 questions, two human annotators nominally outperformed LLMs with an average answer accuracy of 79.5% (CI: 73.5-85.0%). Ultimately, we evaluated medical trainees and LLMs in recommending medical calculators across clinical scenarios like risk stratification and diagnosis. With error analysis showing that the highest-performing LLMs continue to make mistakes in comprehension (49.3% of errors) and calculator knowledge (7.1% of errors), our findings highlight that LLMs are not superior to humans in calculator recommendation.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to support clinical decision-making

Comparing LLMs and humans in medical calculator selection

Assessing LLM errors in comprehension and calculator knowledge

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated nine LLMs for clinical calculator selection

Compared LLMs to humans on medical questions

Analyzed LLM errors in comprehension and knowledge

🔎 Similar Papers

No similar papers found.

Authors to Follow