Performance of Large Language Models in Answering Critical Care Medicine Questions

📅 2025-09-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Despite growing interest in large language models (LLMs) for clinical applications, systematic evaluation of state-of-the-art open-weight models—particularly Meta-Llama 3.1—in high-stakes, domain-specific intensive care medicine (ICM) remains lacking. Method: We conducted the first comprehensive assessment of Llama 3.1 (8B and 70B) on 871 expert-curated, multi-subspecialty clinical questions spanning respiratory, cardiovascular, renal, infectious, neurological, and research-oriented domains in ICM, using a standardized, clinically validated question bank. Contribution/Results: The 70B model achieved a mean accuracy of 60%, substantially outperforming the 8B variant (+30 percentage points), with peak performance on research-oriented questions (68.4%) and lowest on renal topics (47.9%). Our analysis reveals a nonlinear relationship between model scale and subdomain complexity, demonstrating persistent capability gaps in highly specialized clinical reasoning. These findings underscore the necessity of subspecialty-targeted fine-tuning and evaluation frameworks for deploying LLMs safely in critical care settings.

Technology Category

Application Category

📝 Abstract
Large Language Models have been tested on medical student-level questions, but their performance in specialized fields like Critical Care Medicine (CCM) is less explored. This study evaluated Meta-Llama 3.1 models (8B and 70B parameters) on 871 CCM questions. Llama3.1:70B outperformed 8B by 30%, with 60% average accuracy. Performance varied across domains, highest in Research (68.4%) and lowest in Renal (47.9%), highlighting the need for broader future work to improve models across various subspecialty domains.
Problem

Research questions and friction points this paper is trying to address.

Evaluating Large Language Models' performance on Critical Care Medicine questions
Assessing specialized domain accuracy across different medical subspecialties
Identifying performance gaps between model sizes and clinical domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated Meta-Llama 3.1 models on critical care questions
Compared performance of 70B and 8B parameter model versions
Analyzed accuracy variations across different medical subspecialties
🔎 Similar Papers
No similar papers found.
M
Mahmoud Alwakeel
Duke University School of Medicine, Durham, North Carolina, United States
Aditya Nagori
Aditya Nagori
Duke University
Computational BiomedicineGenAI for MedicineIntensive care unitData Science
A
An-Kwok Ian Wong
Duke University School of Medicine, Durham, North Carolina, United States
N
Neal Chaisson
Cleveland Clinic Foundation, Cleveland, Ohio, United States
V
Vijay Krishnamoorthy
Duke University School of Medicine, Durham, North Carolina, United States
Rishikesan Kamaleswaran
Rishikesan Kamaleswaran
Duke University
Host-ResponseInjuryCritical CareMachine LearningArtificial Intelligence