Performance of Large Language Models in Answering Critical Care Medicine Questions

📅 2025-09-16

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Despite growing interest in large language models (LLMs) for clinical applications, systematic evaluation of state-of-the-art open-weight models—particularly Meta-Llama 3.1—in high-stakes, domain-specific intensive care medicine (ICM) remains lacking. Method: We conducted the first comprehensive assessment of Llama 3.1 (8B and 70B) on 871 expert-curated, multi-subspecialty clinical questions spanning respiratory, cardiovascular, renal, infectious, neurological, and research-oriented domains in ICM, using a standardized, clinically validated question bank. Contribution/Results: The 70B model achieved a mean accuracy of 60%, substantially outperforming the 8B variant (+30 percentage points), with peak performance on research-oriented questions (68.4%) and lowest on renal topics (47.9%). Our analysis reveals a nonlinear relationship between model scale and subdomain complexity, demonstrating persistent capability gaps in highly specialized clinical reasoning. These findings underscore the necessity of subspecialty-targeted fine-tuning and evaluation frameworks for deploying LLMs safely in critical care settings.

Technology Category

Application Category

📝 Abstract

Large Language Models have been tested on medical student-level questions, but their performance in specialized fields like Critical Care Medicine (CCM) is less explored. This study evaluated Meta-Llama 3.1 models (8B and 70B parameters) on 871 CCM questions. Llama3.1:70B outperformed 8B by 30%, with 60% average accuracy. Performance varied across domains, highest in Research (68.4%) and lowest in Renal (47.9%), highlighting the need for broader future work to improve models across various subspecialty domains.

Problem

Research questions and friction points this paper is trying to address.

Evaluating Large Language Models' performance on Critical Care Medicine questions

Assessing specialized domain accuracy across different medical subspecialties

Identifying performance gaps between model sizes and clinical domains

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated Meta-Llama 3.1 models on critical care questions

Compared performance of 70B and 8B parameter model versions

Analyzed accuracy variations across different medical subspecialties

🔎 Similar Papers

No similar papers found.

Authors to Follow