Clean&Clear: Feasibility of Safe LLM Clinical Guidance

📅 2025-03-26

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the challenges of slow and inaccurate clinical guideline information retrieval by developing a trustworthy question-answering system based on Llama-3.1-8B. Methodologically, we introduce a “citation-first” paradigm: responses are strictly constrained to verbatim excerpts from official University College London Hospitals (UCLH) clinical guidelines, supported by precise document retrieval and fine-grained passage extraction to eliminate hallucination; evaluation employs a gold-standard framework with multi-physician collaborative annotation. Our key contributions include: (1) the first systematic application of citation-constrained reasoning to clinical guideline QA, achieving 98% recall of guideline statements; and (2) empirical validation showing 73% of answers are highly relevant, 78% are content-complete, and 72% contain no clinically erroneous reasoning—while maintaining an average response latency of 10 seconds (three times faster than manual lookup), thereby significantly enhancing the safety, traceability, and efficiency of clinical decision-making.

Technology Category

Application Category

📝 Abstract

Background: Clinical guidelines are central to safe evidence-based medicine in modern healthcare, providing diagnostic criteria, treatment options and monitoring advice for a wide range of illnesses. LLM-empowered chatbots have shown great promise in Healthcare Q&A tasks, offering the potential to provide quick and accurate responses to medical inquiries. Our main objective was the development and preliminary assessment of an LLM-empowered chatbot software capable of reliably answering clinical guideline questions using University College London Hospital (UCLH) clinical guidelines. Methods: We used the open-weight Llama-3.1-8B LLM to extract relevant information from the UCLH guidelines to answer questions. Our approach highlights the safety and reliability of referencing information over its interpretation and response generation. Seven doctors from the ward assessed the chatbot's performance by comparing its answers to the gold standard. Results: Our chatbot demonstrates promising performance in terms of relevance, with ~73% of its responses rated as very relevant, showcasing a strong understanding of the clinical context. Importantly, our chatbot achieves a recall of 0.98 for extracted guideline lines, substantially minimising the risk of missing critical information. Approximately 78% of responses were rated satisfactory in terms of completeness. A small portion (~14.5%) contained minor unnecessary information, indicating occasional lapses in precision. The chatbot' showed high efficiency, with an average completion time of 10 seconds, compared to 30 seconds for human respondents. Evaluation of clinical reasoning showed that 72% of the chatbot's responses were without flaws. Our chatbot demonstrates significant potential to speed up and improve the process of accessing locally relevant clinical information for healthcare professionals.

Problem

Research questions and friction points this paper is trying to address.

Develops LLM chatbot for clinical guideline queries

Ensures safety by prioritizing information referencing

Assesses chatbot accuracy against doctor evaluations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Llama-3.1-8B LLM for guideline extraction

Prioritizes safety via referenced information over generation

Achieves high recall (0.98) for critical information

🔎 Similar Papers

No similar papers found.

Authors to Follow