Generative Large Language Models Trained for Detecting Errors in Radiology Reports

📅 2025-04-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses four clinically critical semantic errors in radiology reports—negation, laterality, temporal progression, and transcription—by constructing the first dual-source annotated dataset integrating GPT-4–synthesized erroneous samples with real MIMIC-CXR reports. We propose an LLM-based automated error detection framework and demonstrate, for the first time, that Llama-3-70B-Instruct achieves high zero-shot performance on clinical semantic error identification. Supervised fine-tuning further improves accuracy, yielding an overall F1-score of 0.780. A double-blind evaluation by radiologists on 200 flagged instances confirmed 163 as genuine errors, achieving an 81.5% clinical acceptance rate. This work establishes a novel LLM-driven paradigm for radiology report quality control and provides a reproducible, verifiable methodological foundation for automated clinical text quality assurance.

Technology Category

Application Category

📝 Abstract
In this retrospective study, a dataset was constructed with two parts. The first part included 1,656 synthetic chest radiology reports generated by GPT-4 using specified prompts, with 828 being error-free synthetic reports and 828 containing errors. The second part included 614 reports: 307 error-free reports between 2011 and 2016 from the MIMIC-CXR database and 307 corresponding synthetic reports with errors generated by GPT-4 on the basis of these MIMIC-CXR reports and specified prompts. All errors were categorized into four types: negation, left/right, interval change, and transcription errors. Then, several models, including Llama-3, GPT-4, and BiomedBERT, were refined using zero-shot prompting, few-shot prompting, or fine-tuning strategies. Finally, the performance of these models was evaluated using the F1 score, 95% confidence interval (CI) and paired-sample t-tests on our constructed dataset, with the prediction results further assessed by radiologists. Using zero-shot prompting, the fine-tuned Llama-3-70B-Instruct model achieved the best performance with the following F1 scores: 0.769 for negation errors, 0.772 for left/right errors, 0.750 for interval change errors, 0.828 for transcription errors, and 0.780 overall. In the real-world evaluation phase, two radiologists reviewed 200 randomly selected reports output by the model. Of these, 99 were confirmed to contain errors detected by the models by both radiologists, and 163 were confirmed to contain model-detected errors by at least one radiologist. Generative LLMs, fine-tuned on synthetic and MIMIC-CXR radiology reports, greatly enhanced error detection in radiology reports.
Problem

Research questions and friction points this paper is trying to address.

Detecting errors in radiology reports using generative LLMs
Evaluating model performance on synthetic and real-world datasets
Categorizing and detecting specific types of radiology report errors
Innovation

Methods, ideas, or system contributions that make the work stand out.

GPT-4 generates synthetic radiology reports with errors
Fine-tuned Llama-3-70B-Instruct detects errors best
Zero-shot prompting enhances error detection performance
C
Cong Sun
Department of Population Health Science, Weill Cornell Medicine, New York, NY
K
Kurt Teichman
Department of Radiology, Weill Cornell Medicine, New York, NY
Yiliang Zhou
Yiliang Zhou
University of California, Irvine
NLPAI in healthcareLLM
B
Brian Critelli
Department of Radiology, Weill Cornell Medicine, New York, NY
D
David Nauheim
Department of Radiology, Weill Cornell Medicine, New York, NY
G
Graham Keir
Department of Radiology, Weill Cornell Medicine, New York, NY
Xindi Wang
Xindi Wang
Assistant Professor, Shandong University
Natural Language ProcessingAI4HealthcareClinical NLPBioNLPTrustworthy AI
J
Judy Zhong
Department of Population Health Science, Weill Cornell Medicine, New York, NY
A
Adam E Flanders
Department of Radiology, Thomas Jefferson University, Philadelphia, PA
G
George Shih
Department of Radiology, Weill Cornell Medicine, New York, NY
Y
Yifan Peng
Department of Population Health Science, Weill Cornell Medicine, New York, NY