Semantic Similarity in Radiology Reports via LLMs and NER

📅 2025-10-03

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the challenge faced by radiology trainees in identifying semantic discrepancies between preliminary and final radiology reports. We propose Llama-EntScore, a novel method that synergistically integrates the global semantic modeling capability of Llama 3.1 with fine-grained clinical entity detection via named entity recognition (NER). A tunable weighting mechanism dynamically balances contributions from both components, yielding an interpretable, quantitative similarity score. Evaluated on a ground-truth dataset annotated by board-certified radiologists, Llama-EntScore achieves 67% exact-match accuracy and attains 93%±1 accuracy within a ±1 error margin—outperforming standalone large language model (LLM) or NER baselines. The method significantly enhances both accuracy and interpretability in discrepancy detection, establishing a new paradigm for adaptive radiology education feedback and systematic identification of clinical knowledge gaps.

Technology Category

Application Category

📝 Abstract

Radiology report evaluation is a crucial part of radiologists' training and plays a key role in ensuring diagnostic accuracy. As part of the standard reporting workflow, a junior radiologist typically prepares a preliminary report, which is then reviewed and edited by a senior radiologist to produce the final report. Identifying semantic differences between preliminary and final reports is essential for junior doctors, both as a training tool and to help uncover gaps in clinical knowledge. While AI in radiology is a rapidly growing field, the application of large language models (LLMs) remains challenging due to the need for specialised domain knowledge. In this paper, we explore the ability of LLMs to provide explainable and accurate comparisons of reports in the radiology domain. We begin by comparing the performance of several LLMs in comparing radiology reports. We then assess a more traditional approach based on Named-Entity-Recognition (NER). However, both approaches exhibit limitations in delivering accurate feedback on semantic similarity. To address this, we propose Llama-EntScore, a semantic similarity scoring method using a combination of Llama 3.1 and NER with tunable weights to emphasise or de-emphasise specific types of differences. Our approach generates a quantitative similarity score for tracking progress and also gives an interpretation of the score that aims to offer valuable guidance in reviewing and refining their reporting. We find our method achieves 67% exact-match accuracy and 93% accuracy within +/- 1 when compared to radiologist-provided ground truth scores - outperforming both LLMs and NER used independently. Code is available at: href{https://github.com/otmive/llama_reports}{github.com/otmive/llama_reports}

Problem

Research questions and friction points this paper is trying to address.

Identifying semantic differences between preliminary and final radiology reports

Providing explainable comparisons of radiology reports using LLMs

Developing accurate semantic similarity scoring for radiology report evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines Llama 3.1 with NER for semantic scoring

Uses tunable weights to emphasize specific differences

Generates quantitative scores and interpretable feedback

🔎 Similar Papers

No similar papers found.

Authors to Follow