π€ AI Summary
A standardized benchmark for evaluating large language modelsβ (LLMs) adherence to clinical guidelines and capacity for evidence-based clinical reasoning remains lacking.
Method: We systematically integrated UK National Institute for Health and Care Excellence (NICE) clinical guidelines to construct the first silver-standard evaluation dataset grounded in real-world diagnostic scenarios, covering multiple disease domains. Structured patient cases and clinically relevant questions were generated using GPT-assisted prompting and rigorously validated by domain-expert clinicians.
Contribution/Results: Empirical evaluation of leading LLMs on this benchmark demonstrates its sensitivity in differentiating modelsβ guideline-conformant reasoning capabilities. This work bridges a critical gap in AI evaluation for evidence-based medicine, delivering a reproducible, extensible, and standardized assessment framework. It establishes a foundational tool for regulatory compliance verification and iterative improvement of clinical LLMs.
π Abstract
Large language models (LLMs) are increasingly used in healthcare, yet standardised benchmarks for evaluating guideline-based clinical reasoning are missing. This study introduces a validated dataset derived from publicly available guidelines across multiple diagnoses. The dataset was created with the help of GPT and contains realistic patient scenarios, as well as clinical questions. We benchmark a range of recent popular LLMs to showcase the validity of our dataset. The framework supports systematic evaluation of LLMs' clinical utility and guideline adherence.