🤖 AI Summary
Peer review of systematic reviews (SRs) against PRISMA 2020 guidelines is labor-intensive and prone to inconsistency. Method: This study investigates automated LLM-based assessment of PRISMA 2020 compliance, proposing a structured input paradigm and constructing a copyright-compliant, open benchmark dataset. We systematically evaluate ten LLMs across five input formats—Markdown, JSON, XML, plain text, and full-text—and assess performance in sensitivity and specificity. Contribution/Results: Structured inputs significantly improve accuracy (78.7%–79.7%), markedly outperforming full-text input (45.21%); no statistically significant differences exist among structured formats. Open-weight, high-sensitivity models (e.g., Qwen3-Max) demonstrate robust end-to-end performance, achieving 95.1% sensitivity and 49.3% specificity. This work provides the first empirical validation that structured prompting enables efficient, reproducible, and scalable automation of PRISMA compliance checking—establishing a methodological foundation and practical benchmark for AI-assisted evidence synthesis in evidence-based medicine.
📝 Abstract
Evaluating adherence to PRISMA 2020 guideline remains a burden in the peer review process. To address the lack of shareable benchmarks, we constructed a copyright-aware benchmark of 108 Creative Commons-licensed systematic reviews and evaluated ten large language models (LLMs) across five input formats. In a development cohort, supplying structured PRISMA 2020 checklists (Markdown, JSON, XML, or plain text) yielded 78.7-79.7% accuracy versus 45.21% for manuscript-only input (p less than 0.0001), with no differences between structured formats (p>0.9). Across models, accuracy ranged from 70.6-82.8% with distinct sensitivity-specificity trade-offs, replicated in an independent validation cohort. We then selected Qwen3-Max (a high-sensitivity open-weight model) and extended evaluation to the full dataset (n=120), achieving 95.1% sensitivity and 49.3% specificity. Structured checklist provision substantially improves LLM-based PRISMA assessment, though human expert verification remains essential before editorial decisions.