Large language models for automated PRISMA 2020 adherence checking

📅 2025-11-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

153K/year
🤖 AI Summary
Peer review of systematic reviews (SRs) against PRISMA 2020 guidelines is labor-intensive and prone to inconsistency. Method: This study investigates automated LLM-based assessment of PRISMA 2020 compliance, proposing a structured input paradigm and constructing a copyright-compliant, open benchmark dataset. We systematically evaluate ten LLMs across five input formats—Markdown, JSON, XML, plain text, and full-text—and assess performance in sensitivity and specificity. Contribution/Results: Structured inputs significantly improve accuracy (78.7%–79.7%), markedly outperforming full-text input (45.21%); no statistically significant differences exist among structured formats. Open-weight, high-sensitivity models (e.g., Qwen3-Max) demonstrate robust end-to-end performance, achieving 95.1% sensitivity and 49.3% specificity. This work provides the first empirical validation that structured prompting enables efficient, reproducible, and scalable automation of PRISMA compliance checking—establishing a methodological foundation and practical benchmark for AI-assisted evidence synthesis in evidence-based medicine.

Technology Category

Application Category

📝 Abstract
Evaluating adherence to PRISMA 2020 guideline remains a burden in the peer review process. To address the lack of shareable benchmarks, we constructed a copyright-aware benchmark of 108 Creative Commons-licensed systematic reviews and evaluated ten large language models (LLMs) across five input formats. In a development cohort, supplying structured PRISMA 2020 checklists (Markdown, JSON, XML, or plain text) yielded 78.7-79.7% accuracy versus 45.21% for manuscript-only input (p less than 0.0001), with no differences between structured formats (p>0.9). Across models, accuracy ranged from 70.6-82.8% with distinct sensitivity-specificity trade-offs, replicated in an independent validation cohort. We then selected Qwen3-Max (a high-sensitivity open-weight model) and extended evaluation to the full dataset (n=120), achieving 95.1% sensitivity and 49.3% specificity. Structured checklist provision substantially improves LLM-based PRISMA assessment, though human expert verification remains essential before editorial decisions.
Problem

Research questions and friction points this paper is trying to address.

Automating PRISMA 2020 guideline adherence checking for systematic reviews
Evaluating large language models' accuracy in assessing PRISMA 2020 compliance
Addressing lack of shareable benchmarks for automated systematic review evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using structured PRISMA checklists as LLM input
Evaluating ten LLMs across five input formats
Selecting high-sensitivity open-weight model Qwen3-Max
Y
Yuki Kataoka
Center for Postgraduate Clinical Training and Career Development, Nagoya University Hospital, Nagoya, Aichi, Japan; Center for Medical Education, Graduate School of Medicine, Nagoya University, Nagoya, Aichi, Japan; Scientific Research Works Peer Support Group (SRWS-PSG), Osaka, Japan; Department of Internal Medicine, Kyoto Min-iren Asukai Hospital, Kyoto, Japan; Department of Healthcare Epidemiology, Kyoto University Graduate School of Medicine / School of Public Health, Kyoto, Japan; Department of Interna
R
Ryuhei So
Scientific Research Works Peer Support Group (SRWS-PSG), Osaka, Japan; Department of Psychiatry, Okayama Psychiatric Medical Center, Okayama, Japan; CureApp, Inc., Tokyo, Japan
M
Masahiro Banno
Scientific Research Works Peer Support Group (SRWS-PSG), Osaka, Japan; Department of Psychiatry, Seichiryo Hospital, Nagoya, Japan
Y
Yasushi Tsujimoto
Scientific Research Works Peer Support Group (SRWS-PSG), Osaka, Japan; Oku medical clinic, Osaka, Japan; Department of Health Promotion and Human Behavior, Kyoto University Graduate School of Medicine / School of Public Health, Kyoto, Japan
T
Tomohiro Takayama
Kyoto University Hospital, Kyoto, Japan
Y
Yosuke Yamagishi
Division of Radiology and Biomedical Engineering, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
T
Takahiro Tsuge
Scientific Research Works Peer Support Group (SRWS-PSG), Osaka, Japan; Department of Rehabilitation, Kurashiki Medical Centre, Kurashiki, Okayama, Japan; Department of Epidemiology, Graduate School of Medicine, Dentistry, and Pharmaceutical Sciences, Okayama University, Okayama, Japan
N
Norio Yamamoto
Scientific Research Works Peer Support Group (SRWS-PSG), Osaka, Japan; Department of Orthopedic Surgery, Minato Medical Coop-Kyoritsu General Hospital, Nagoya, Aichi, Japan
C
Chiaki Suda
Scientific Research Works Peer Support Group (SRWS-PSG), Osaka, Japan; Department of Public Health, Gunma University Graduate School of Medicine, Gunma, Japan
Toshi A. Furukawa
Toshi A. Furukawa
Professor of Clinical Epidemiology, Kyoto University
Clinical epidemiologyCognitive-behavioral therapySystematic reviews