Multi-RADS Synthetic Radiology Report Dataset and Head-to-Head Benchmarking of 41 Open-Weight and Proprietary Language Models

📅 2026-01-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Automatically mapping standardized RADS classifications from narrative radiology reports faces challenges including complex guidelines, constrained outputs, and a lack of systematic evaluation. This work introduces RXL-RADSet, the first synthetic multimodal radiology report benchmark covering ten RADS standards, comprising 1,600 radiologist-validated samples. The authors conduct a head-to-head evaluation of 41 open-source small language models (0.135–32B parameters) alongside GPT-5.2 under a unified prompting strategy. Results show that GPT-5.2 achieves 99.8% validity and 81.1% accuracy with guided prompting, while open-source models of 20–32B parameters reach approximately 99% validity and over 70% accuracy, demonstrating significant performance gains with scale. Guided prompting consistently outperforms zero-shot settings across all evaluated models.

Technology Category

Application Category

📝 Abstract

Background: Reporting and Data Systems (RADS) standardize radiology risk communication but automated RADS assignment from narrative reports is challenging because of guideline complexity, output-format constraints, and limited benchmarking across RADS frameworks and model sizes. Purpose: To create RXL-RADSet, a radiologist-verified synthetic multi-RADS benchmark, and compare validity and accuracy of open-weight small language models (SLMs) with a proprietary model for RADS assignment. Materials and Methods: RXL-RADSet contains 1,600 synthetic radiology reports across 10 RADS (BI-RADS, CAD-RADS, GB-RADS, LI-RADS, Lung-RADS, NI-RADS, O-RADS, PI-RADS, TI-RADS, VI-RADS) and multiple modalities. Reports were generated by LLMs using scenario plans and simulated radiologist styles and underwent two-stage radiologist verification. We evaluated 41 quantized SLMs (12 families, 0.135-32B parameters) and GPT-5.2 under a fixed guided prompt. Primary endpoints were validity and accuracy; a secondary analysis compared guided versus zero-shot prompting. Results: Under guided prompting GPT-5.2 achieved 99.8% validity and 81.1% accuracy (1,600 predictions). Pooled SLMs (65,600 predictions) achieved 96.8% validity and 61.1% accuracy; top SLMs in the 20-32B range reached ~99% validity and mid-to-high 70% accuracy. Performance scaled with model size (inflection between<1B and>=10B) and declined with RADS complexity primarily due to classification difficulty rather than invalid outputs. Guided prompting improved validity (99.2% vs 96.7%) and accuracy (78.5% vs 69.6%) compared with zero-shot. Conclusion: RXL-RADSet provides a radiologist-verified multi-RADS benchmark; large SLMs (20-32B) can approach proprietary-model performance under guided prompting, but gaps remain for higher-complexity schemes.

Problem

Research questions and friction points this paper is trying to address.

RADS

radiology report

automated classification

benchmarking

language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

synthetic radiology reports

multi-RADS benchmark

open-weight language models

guided prompting

radiologist-verified dataset

🔎 Similar Papers

No similar papers found.

Authors to Follow