🤖 AI Summary
Despite growing interest in large language models (LLMs) for clinical applications, systematic evaluation of state-of-the-art open-weight models—particularly Meta-Llama 3.1—in high-stakes, domain-specific intensive care medicine (ICM) remains lacking.
Method: We conducted the first comprehensive assessment of Llama 3.1 (8B and 70B) on 871 expert-curated, multi-subspecialty clinical questions spanning respiratory, cardiovascular, renal, infectious, neurological, and research-oriented domains in ICM, using a standardized, clinically validated question bank.
Contribution/Results: The 70B model achieved a mean accuracy of 60%, substantially outperforming the 8B variant (+30 percentage points), with peak performance on research-oriented questions (68.4%) and lowest on renal topics (47.9%). Our analysis reveals a nonlinear relationship between model scale and subdomain complexity, demonstrating persistent capability gaps in highly specialized clinical reasoning. These findings underscore the necessity of subspecialty-targeted fine-tuning and evaluation frameworks for deploying LLMs safely in critical care settings.
📝 Abstract
Large Language Models have been tested on medical student-level questions, but their performance in specialized fields like Critical Care Medicine (CCM) is less explored. This study evaluated Meta-Llama 3.1 models (8B and 70B parameters) on 871 CCM questions. Llama3.1:70B outperformed 8B by 30%, with 60% average accuracy. Performance varied across domains, highest in Research (68.4%) and lowest in Renal (47.9%), highlighting the need for broader future work to improve models across various subspecialty domains.