Building a Silver-Standard Dataset from NICE Guidelines for Clinical LLMs

📅 2025-11-02

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

A standardized benchmark for evaluating large language models’ (LLMs) adherence to clinical guidelines and capacity for evidence-based clinical reasoning remains lacking. Method: We systematically integrated UK National Institute for Health and Care Excellence (NICE) clinical guidelines to construct the first silver-standard evaluation dataset grounded in real-world diagnostic scenarios, covering multiple disease domains. Structured patient cases and clinically relevant questions were generated using GPT-assisted prompting and rigorously validated by domain-expert clinicians. Contribution/Results: Empirical evaluation of leading LLMs on this benchmark demonstrates its sensitivity in differentiating models’ guideline-conformant reasoning capabilities. This work bridges a critical gap in AI evaluation for evidence-based medicine, delivering a reproducible, extensible, and standardized assessment framework. It establishes a foundational tool for regulatory compliance verification and iterative improvement of clinical LLMs.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly used in healthcare, yet standardised benchmarks for evaluating guideline-based clinical reasoning are missing. This study introduces a validated dataset derived from publicly available guidelines across multiple diagnoses. The dataset was created with the help of GPT and contains realistic patient scenarios, as well as clinical questions. We benchmark a range of recent popular LLMs to showcase the validity of our dataset. The framework supports systematic evaluation of LLMs' clinical utility and guideline adherence.

Problem

Research questions and friction points this paper is trying to address.

Creating standardized dataset for clinical LLM evaluation

Developing validated benchmarks for guideline-based clinical reasoning

Establishing framework to assess LLMs' clinical utility adherence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dataset derived from clinical guidelines

GPT-generated patient scenarios and questions

Framework evaluates LLM clinical utility adherence

🔎 Similar Papers

No similar papers found.

Authors to Follow