Large language models for automated PRISMA 2020 adherence checking

📅 2025-11-19

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

Peer review of systematic reviews (SRs) against PRISMA 2020 guidelines is labor-intensive and prone to inconsistency. Method: This study investigates automated LLM-based assessment of PRISMA 2020 compliance, proposing a structured input paradigm and constructing a copyright-compliant, open benchmark dataset. We systematically evaluate ten LLMs across five input formats—Markdown, JSON, XML, plain text, and full-text—and assess performance in sensitivity and specificity. Contribution/Results: Structured inputs significantly improve accuracy (78.7%–79.7%), markedly outperforming full-text input (45.21%); no statistically significant differences exist among structured formats. Open-weight, high-sensitivity models (e.g., Qwen3-Max) demonstrate robust end-to-end performance, achieving 95.1% sensitivity and 49.3% specificity. This work provides the first empirical validation that structured prompting enables efficient, reproducible, and scalable automation of PRISMA compliance checking—establishing a methodological foundation and practical benchmark for AI-assisted evidence synthesis in evidence-based medicine.

Technology Category

Application Category

📝 Abstract

Evaluating adherence to PRISMA 2020 guideline remains a burden in the peer review process. To address the lack of shareable benchmarks, we constructed a copyright-aware benchmark of 108 Creative Commons-licensed systematic reviews and evaluated ten large language models (LLMs) across five input formats. In a development cohort, supplying structured PRISMA 2020 checklists (Markdown, JSON, XML, or plain text) yielded 78.7-79.7% accuracy versus 45.21% for manuscript-only input (p less than 0.0001), with no differences between structured formats (p>0.9). Across models, accuracy ranged from 70.6-82.8% with distinct sensitivity-specificity trade-offs, replicated in an independent validation cohort. We then selected Qwen3-Max (a high-sensitivity open-weight model) and extended evaluation to the full dataset (n=120), achieving 95.1% sensitivity and 49.3% specificity. Structured checklist provision substantially improves LLM-based PRISMA assessment, though human expert verification remains essential before editorial decisions.

Problem

Research questions and friction points this paper is trying to address.

Automating PRISMA 2020 guideline adherence checking for systematic reviews

Evaluating large language models' accuracy in assessing PRISMA 2020 compliance

Addressing lack of shareable benchmarks for automated systematic review evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using structured PRISMA checklists as LLM input

Evaluating ten LLMs across five input formats

Selecting high-sensitivity open-weight model Qwen3-Max

🔎 Similar Papers

How Privacy-Savvy Are Large Language Models? A Case Study on Compliance and Privacy Technical Review