🤖 AI Summary
Radiology reports are prone to transcription errors, terminology inconsistencies, and template mismatches due to high clinical workload and domain-specific linguistic complexity, thereby compromising diagnostic safety. To address this, we propose a three-stage (detection–localization–correction) automated proofreading framework that emulates expert radiologist review. Our approach introduces a novel dual-knowledge injection mechanism: (1) medical knowledge graph distillation and (2) external knowledge retrieval. We further design a multi-stage decoupled proofreading paradigm tailored to real-world error patterns, integrating knowledge distillation, graph-based reasoning, retrieval-augmented generation, and sequential LLM fine-tuning. Evaluated on a real-world radiology error benchmark, our method achieves a 31.56% improvement in error detection accuracy and reduces processing latency by 37.4%. Clinical evaluation by board-certified radiologists confirms its superior clinical relevance and factual consistency over existing approaches.
📝 Abstract
The increasing complexity and workload of clinical radiology leads to inevitable oversights and mistakes in their use as diagnostic tools, causing delayed treatments and sometimes life-threatening harm to patients. While large language models (LLMs) have shown remarkable progress in many tasks, their utilities in detecting and correcting errors in radiology reporting are limited. This paper proposes a novel dual-knowledge infusion framework that enhances LLMs'capability for radiology report proofreading through systematic integration of medical expertise. Specifically, the knowledge infusion combines medical knowledge graph distillation (MKGD) with external knowledge retrieval (EXKR), enabling an effective automated approach in tackling mistakes in radiology reporting. By decomposing the complex proofreading task into three specialized stages of detection, localization, and correction, our method mirrors the systematic review process employed by expert radiologists, ensuring both precision and clinical interpretability. To perform a robust, clinically relevant evaluation, a comprehensive benchmark is also proposed using real-world radiology reports with real-world error patterns, including speech recognition confusions, terminology ambiguities, and template-related inconsistencies. Extensive evaluations across multiple LLM architectures demonstrate substantial improvements of our approach: up to 31.56% increase in error detection accuracy and 37.4% reduction in processing time. Human evaluation by radiologists confirms superior clinical relevance and factual consistency compared to existing approaches.