Generative Large Language Models Trained for Detecting Errors in Radiology Reports

📅 2025-04-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses four clinically critical semantic errors in radiology reports—negation, laterality, temporal progression, and transcription—by constructing the first dual-source annotated dataset integrating GPT-4–synthesized erroneous samples with real MIMIC-CXR reports. We propose an LLM-based automated error detection framework and demonstrate, for the first time, that Llama-3-70B-Instruct achieves high zero-shot performance on clinical semantic error identification. Supervised fine-tuning further improves accuracy, yielding an overall F1-score of 0.780. A double-blind evaluation by radiologists on 200 flagged instances confirmed 163 as genuine errors, achieving an 81.5% clinical acceptance rate. This work establishes a novel LLM-driven paradigm for radiology report quality control and provides a reproducible, verifiable methodological foundation for automated clinical text quality assurance.

Technology Category

Application Category

📝 Abstract

In this retrospective study, a dataset was constructed with two parts. The first part included 1,656 synthetic chest radiology reports generated by GPT-4 using specified prompts, with 828 being error-free synthetic reports and 828 containing errors. The second part included 614 reports: 307 error-free reports between 2011 and 2016 from the MIMIC-CXR database and 307 corresponding synthetic reports with errors generated by GPT-4 on the basis of these MIMIC-CXR reports and specified prompts. All errors were categorized into four types: negation, left/right, interval change, and transcription errors. Then, several models, including Llama-3, GPT-4, and BiomedBERT, were refined using zero-shot prompting, few-shot prompting, or fine-tuning strategies. Finally, the performance of these models was evaluated using the F1 score, 95% confidence interval (CI) and paired-sample t-tests on our constructed dataset, with the prediction results further assessed by radiologists. Using zero-shot prompting, the fine-tuned Llama-3-70B-Instruct model achieved the best performance with the following F1 scores: 0.769 for negation errors, 0.772 for left/right errors, 0.750 for interval change errors, 0.828 for transcription errors, and 0.780 overall. In the real-world evaluation phase, two radiologists reviewed 200 randomly selected reports output by the model. Of these, 99 were confirmed to contain errors detected by the models by both radiologists, and 163 were confirmed to contain model-detected errors by at least one radiologist. Generative LLMs, fine-tuned on synthetic and MIMIC-CXR radiology reports, greatly enhanced error detection in radiology reports.

Problem

Research questions and friction points this paper is trying to address.

Detecting errors in radiology reports using generative LLMs

Evaluating model performance on synthetic and real-world datasets

Categorizing and detecting specific types of radiology report errors

Innovation

Methods, ideas, or system contributions that make the work stand out.

GPT-4 generates synthetic radiology reports with errors

Fine-tuned Llama-3-70B-Instruct detects errors best

Zero-shot prompting enhances error detection performance

🔎 Similar Papers

Error Correction in Radiology Reports: A Knowledge Distillation-Based Multi-Stage Framework

2024-06-21Citations: 0

Authors to Follow