Ten Headache Specialists versus Artificial Intelligence for Clinical Literature Summarization: A Critical Evaluation and Comparison

📅 2026-06-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

180K/year
🤖 AI Summary
This study addresses the tension between the exponential growth of medical literature and clinicians’ limited time by developing a retrieval-augmented generation (RAG)-based agent framework that integrates three state-of-the-art large language models—Sonnet, GPT-4o, and Llama 3.1—to automatically generate narrative reviews on ten key clinical questions in migraine care. For the first time in a specialized clinical context, the quality of these AI-generated reviews was systematically compared against those authored by ten domain experts through a double-blind evaluation. While expert-written reviews were generally preferred, evaluators frequently struggled to reliably distinguish between human- and AI-generated content. The study further identified critical quality dimensions influencing clinical utility that extend beyond conventional automated metrics, offering new directions for optimizing human–AI collaboration in evidence-based decision-making.
📝 Abstract
Summarizing the latest medical literature to guide clinical decision-making is essential for evidence-based medicine and high-quality patient care. Yet clinicians face increasing challenges due to limited time with patients and a rapidly growing volume of published articles. Although retrieval-augmented large language models (LLMs) have shown promise in clinical summarization, human evaluations of their effectiveness in synthesizing broader scientific literature and direct comparisons to expert-written syntheses remain scarce. We constructed a RAG-based agentic AI framework using three state-of-the-art LLMs: Sonnet, GPT-4o, and Llama 3.1. A headache specialist created 13 questions, three for prompt optimization and ten for evaluation. Ten headache specialists across the United States and Canada each wrote a summary for one question, yielding four summaries per question (expert, Sonnet, GPT-4o, and Llama). The experts, blinded to authorship, critically evaluated the summaries, excluding the topic for which they wrote a summary, based on correctness, completeness, conciseness, and clinical utility, scoring each from 1 to 10 using standardized rubrics. They also ranked the summaries by preference and indicated whether they believed each summary was written by an expert or an LLM. Our study, comparing LLM- and expert-written literature summaries evaluated by headache specialists, showed that expert-written summaries were preferred, although experts sometimes found it challenging to distinguish between human- and AI-generated summaries. We also identified key expert-valued features beyond standard evaluation metrics that can guide future refinement of both human and AI literature summarization pipelines.
Problem

Research questions and friction points this paper is trying to address.

clinical literature summarization
artificial intelligence
large language models
expert evaluation
evidence-based medicine
Innovation

Methods, ideas, or system contributions that make the work stand out.

retrieval-augmented generation
large language models
clinical summarization
expert evaluation
evidence-based medicine
🔎 Similar Papers
No similar papers found.
Alejandro Lozano
Alejandro Lozano
Stanford University
Foundation ModelsMultimodal LearningRetrieval Augmentation
K
Keiko Ihara
Department of Neurology, Mayo Clinic, Rochester, MN, USA
P
Ping-Hao Yang
Department of Neurology, Dalhousie University, Halifax, Canada
C
Carrie E. Robertson
Jefferson Headache Center, Department of Neurology, Thomas Jefferson University, PA, USA
Jennifer Stern
Jennifer Stern
NASA Goddard Space Flight Center
MarsNitrogenAstrobiologyStable Isotopes
A
Allan Purdy
Department of Neurology, University of Florida, Gainesville, FL, USA
H
Hsiangkuo Yuan
University of Colorado School of Medicine, Department of Pediatrics, Division of Child Neurology, Aurora, CO, USA
P
Pengfei Zhang
Department of Medicine, Mount Sinai Hospital, Icahn School of Medicine at Mount Sinai, New York, NY, USA
Y
Yulia Orlova
Department of Neurology, Mayo Clinic, Scottsdale, AZ, USA
O
Olga Fermo
Harvard Medical School, Boston, MA, USA
J
Jennifer Hranilovich
Department of Neurology, Mount Sinai Hospital, Icahn School of Medicine at Mount Sinai, New York, NY, USA
F
Fred Cohen
Department of Neurology, Mayo Clinic, Rochester, MN, USA
T
Todd J. Schwedt
Department of Neurology, Mayo Clinic, Rochester, MN, USA
J
Jenelle A. Jindal
Stanford University, Palo Alto, CA, USA
Serena Yeung-Levy
Serena Yeung-Levy
Stanford University
Artificial IntelligenceComputer Vision
C
Chia-Chun Chiang
Department of Neurology, Mayo Clinic, Rochester, MN, USA