Identifying Imaging Follow-Up in Radiology Reports: A Comparative Analysis of Traditional ML and LLM Approaches

📅 2025-11-14

📈 Citations: 0

✨ Influential: 0

career value

151K/year

🤖 AI Summary

To address the scarcity of high-quality domain-specific datasets for identifying imaging follow-up status in radiology reports, this study introduces the first large-scale annotated dataset comprising 6,393 reports. We systematically evaluate diverse approaches: traditional machine learning models (Logistic Regression, SVM), long-context models (Longformer), fine-tuned open-source large language models (Llama3-8B-Instruct), and proprietary/open-source generative models (GPT-4o, GPT-OSS-20B). A key contribution is a context-aware prompting strategy specifically designed for follow-up detection, which significantly enhances generative model inference accuracy. Experimental results show that GPT-4o (Advanced) achieves the highest F1 score of 0.832, closely followed by GPT-OSS-20B at 0.828—both approaching inter-annotator agreement levels. Traditional models also demonstrate strong baseline performance. This work establishes a new benchmark and methodological paradigm for temporal decision modeling in clinical text.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have shown considerable promise in clinical natural language processing, yet few domain-specific datasets exist to rigorously evaluate their performance on radiology tasks. In this work, we introduce an annotated corpus of 6,393 radiology reports from 586 patients, each labeled for follow-up imaging status, to support the development and benchmarking of follow-up adherence detection systems. Using this corpus, we systematically compared traditional machine-learning classifiers, including logistic regression (LR), support vector machines (SVM), Longformer, and a fully fine-tuned Llama3-8B-Instruct, with recent generative LLMs. To evaluate generative LLMs, we tested GPT-4o and the open-source GPT-OSS-20B under two configurations: a baseline (Base) and a task-optimized (Advanced) setting that focused inputs on metadata, recommendation sentences, and their surrounding context. A refined prompt for GPT-OSS-20B further improved reasoning accuracy. Performance was assessed using precision, recall, and F1 scores with 95% confidence intervals estimated via non-parametric bootstrapping. Inter-annotator agreement was high (F1 = 0.846). GPT-4o (Advanced) achieved the best performance (F1 = 0.832), followed closely by GPT-OSS-20B (Advanced; F1 = 0.828). LR and SVM also performed strongly (F1 = 0.776 and 0.775), underscoring that while LLMs approach human-level agreement through prompt optimization, interpretable and resource-efficient models remain valuable baselines.

Problem

Research questions and friction points this paper is trying to address.

Identifying follow-up imaging recommendations in radiology reports automatically

Comparing traditional machine learning with large language models for this task

Evaluating performance of different approaches using annotated radiology report corpus

Innovation

Methods, ideas, or system contributions that make the work stand out.

Annotated corpus of radiology reports for benchmarking

Compared traditional ML classifiers with generative LLMs

Advanced prompt optimization improved reasoning accuracy

🔎 Similar Papers

MedRG: Medical Report Grounding with Multi-modal Large Language Model