🤖 AI Summary
This study addresses the tension between the exponential growth of medical literature and clinicians’ limited time by developing a retrieval-augmented generation (RAG)-based agent framework that integrates three state-of-the-art large language models—Sonnet, GPT-4o, and Llama 3.1—to automatically generate narrative reviews on ten key clinical questions in migraine care. For the first time in a specialized clinical context, the quality of these AI-generated reviews was systematically compared against those authored by ten domain experts through a double-blind evaluation. While expert-written reviews were generally preferred, evaluators frequently struggled to reliably distinguish between human- and AI-generated content. The study further identified critical quality dimensions influencing clinical utility that extend beyond conventional automated metrics, offering new directions for optimizing human–AI collaboration in evidence-based decision-making.
📝 Abstract
Summarizing the latest medical literature to guide clinical decision-making is essential for evidence-based medicine and high-quality patient care. Yet clinicians face increasing challenges due to limited time with patients and a rapidly growing volume of published articles. Although retrieval-augmented large language models (LLMs) have shown promise in clinical summarization, human evaluations of their effectiveness in synthesizing broader scientific literature and direct comparisons to expert-written syntheses remain scarce. We constructed a RAG-based agentic AI framework using three state-of-the-art LLMs: Sonnet, GPT-4o, and Llama 3.1. A headache specialist created 13 questions, three for prompt optimization and ten for evaluation. Ten headache specialists across the United States and Canada each wrote a summary for one question, yielding four summaries per question (expert, Sonnet, GPT-4o, and Llama). The experts, blinded to authorship, critically evaluated the summaries, excluding the topic for which they wrote a summary, based on correctness, completeness, conciseness, and clinical utility, scoring each from 1 to 10 using standardized rubrics. They also ranked the summaries by preference and indicated whether they believed each summary was written by an expert or an LLM. Our study, comparing LLM- and expert-written literature summaries evaluated by headache specialists, showed that expert-written summaries were preferred, although experts sometimes found it challenging to distinguish between human- and AI-generated summaries. We also identified key expert-valued features beyond standard evaluation metrics that can guide future refinement of both human and AI literature summarization pipelines.