Multi-RADS Synthetic Radiology Report Dataset and Head-to-Head Benchmarking of 41 Open-Weight and Proprietary Language Models

📅 2026-01-06
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Automatically mapping standardized RADS classifications from narrative radiology reports faces challenges including complex guidelines, constrained outputs, and a lack of systematic evaluation. This work introduces RXL-RADSet, the first synthetic multimodal radiology report benchmark covering ten RADS standards, comprising 1,600 radiologist-validated samples. The authors conduct a head-to-head evaluation of 41 open-source small language models (0.135–32B parameters) alongside GPT-5.2 under a unified prompting strategy. Results show that GPT-5.2 achieves 99.8% validity and 81.1% accuracy with guided prompting, while open-source models of 20–32B parameters reach approximately 99% validity and over 70% accuracy, demonstrating significant performance gains with scale. Guided prompting consistently outperforms zero-shot settings across all evaluated models.

Technology Category

Application Category

📝 Abstract
Background: Reporting and Data Systems (RADS) standardize radiology risk communication but automated RADS assignment from narrative reports is challenging because of guideline complexity, output-format constraints, and limited benchmarking across RADS frameworks and model sizes. Purpose: To create RXL-RADSet, a radiologist-verified synthetic multi-RADS benchmark, and compare validity and accuracy of open-weight small language models (SLMs) with a proprietary model for RADS assignment. Materials and Methods: RXL-RADSet contains 1,600 synthetic radiology reports across 10 RADS (BI-RADS, CAD-RADS, GB-RADS, LI-RADS, Lung-RADS, NI-RADS, O-RADS, PI-RADS, TI-RADS, VI-RADS) and multiple modalities. Reports were generated by LLMs using scenario plans and simulated radiologist styles and underwent two-stage radiologist verification. We evaluated 41 quantized SLMs (12 families, 0.135-32B parameters) and GPT-5.2 under a fixed guided prompt. Primary endpoints were validity and accuracy; a secondary analysis compared guided versus zero-shot prompting. Results: Under guided prompting GPT-5.2 achieved 99.8% validity and 81.1% accuracy (1,600 predictions). Pooled SLMs (65,600 predictions) achieved 96.8% validity and 61.1% accuracy; top SLMs in the 20-32B range reached ~99% validity and mid-to-high 70% accuracy. Performance scaled with model size (inflection between<1B and>=10B) and declined with RADS complexity primarily due to classification difficulty rather than invalid outputs. Guided prompting improved validity (99.2% vs 96.7%) and accuracy (78.5% vs 69.6%) compared with zero-shot. Conclusion: RXL-RADSet provides a radiologist-verified multi-RADS benchmark; large SLMs (20-32B) can approach proprietary-model performance under guided prompting, but gaps remain for higher-complexity schemes.
Problem

Research questions and friction points this paper is trying to address.

RADS
radiology report
automated classification
benchmarking
language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

synthetic radiology reports
multi-RADS benchmark
open-weight language models
guided prompting
radiologist-verified dataset
🔎 Similar Papers
No similar papers found.
K
Kartik Bose
Department of Radiodiagnosis, Postgraduate Institute of Medical Education and Research, Chandigarh, India 160012
Abhinandan Kumar
Abhinandan Kumar
University of Calgary
Plant BiologyAgricultureGene editingGMOs
R
Raghuraman Soundararajan
Department of Radiodiagnosis, Postgraduate Institute of Medical Education and Research, Chandigarh, India 160012
P
Priya Mudgil
Department of Radiodiagnosis, Postgraduate Institute of Medical Education and Research, Chandigarh, India 160012
S
Samonee Ralmilay
Department of Radiodiagnosis, Postgraduate Institute of Medical Education and Research, Chandigarh, India 160012
N
Niharika Dutta
Department of Radiodiagnosis, Postgraduate Institute of Medical Education and Research, Chandigarh, India 160012
M
M. Singhal
Department of Radiodiagnosis, Postgraduate Institute of Medical Education and Research, Chandigarh, India 160012
A
Arun Kumar
Department of Radiodiagnosis, Postgraduate Institute of Medical Education and Research, Chandigarh, India 160012
S
Saugata Sen
Department of Radiodiagnosis, Tata Medical Center, Kolkata, India 700156
A
Anurima Patra
Department of Radiodiagnosis, Tata Medical Center, Kolkata, India 700156
P
Priya Ghosh
Department of Radiodiagnosis, Tata Medical Center, Kolkata, India 700156
A
Abanti Das
Department of Radiodiagnosis, All India Institute of Medical Sciences, Kalyani, India 741245
Amit Gupta
Amit Gupta
Mehra Chair and Professor of Mechanical Engineering, Indian Institute of Technology Delhi
Li-ion BatteriesThermal management of batteriesMicrofluidicsFlapping wing aerodynamicsLBM
A
Ashish Verma
Department of Radiodiagnosis, Banaras Hindu University, Varanasi, India 221005
D
Dipin Sudhakaran
Department of Radiodiagnosis, Aster Malabar Institute of Medical Sciences, Kerala, India 670621
E
E. Dhamija
Department of Radiodiagnosis, All India Institute of Medical Sciences, New Delhi, India 110029
H
Himangi Unde
Department of Radiodiagnosis, Tata Main Hospital, Mumbai, India 400012
I
Ishan Kumar
Department of Radiodiagnosis, Banaras Hindu University, Varanasi, India 221005
K
K. Rangarajan
Department of Radiodiagnosis, All India Institute of Medical Sciences, New Delhi, India 110029
P
Prerna Garg
Department of Radiodiagnosis, Rajiv Gandhi Cancer Institute and Research Centre, Delhi, India 110085
R
Rachel Sequeira
Department of Radiodiagnosis, Tata Main Hospital, Mumbai, India 400012
S
Sudhin Shylendran
Department of Radiodiagnosis, Baby Memorial Hospital, Kerala, India 670621
T
Taruna Yadav
Department of Radiodiagnosis, All India Institute of Medical Sciences, Jodhpur, India 342005
T
Tej Pal
Department of Radiodiagnosis, National Cancer Institute, Jhajjar, India 124105
Pankaj Gupta
Pankaj Gupta
Postgraduate Institute of Medical Education and Research, Chandigarh
Radiodiagnosis