ReFINE: A Reward-Based Framework for Interpretable and Nuanced Evaluation of Radiology Report Generation

📅 2024-11-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Radiology report generation (R2Gen) evaluation has long suffered from rigid lexical matching and unidimensional scoring, resulting in poor correlation with human judgments. To address this, we propose the first explainable, fine-grained, and user-customizable reward-based automatic evaluation framework. Our method leverages GPT-4 to synthesize high-quality paired data and trains an LLM-based reward model guided by a dual-scoring mechanism. Crucially, we introduce a margin-based reward constraint loss that enables dynamic, modular specification of clinical dimensions—including pathology, anatomical structures, and linguistic quality—and outputs both an overall score and interpretable sub-scores. Experiments demonstrate a Pearson correlation of 0.82 with expert human ratings—significantly surpassing conventional metrics (e.g., BLEU, CIDEr)—and confirm strong cross-evaluation generalizability.

Technology Category

Application Category

📝 Abstract
Automated radiology report generation (R2Gen) has advanced significantly, introducing challenges in accurate evaluation due to its complexity. Traditional metrics often fall short by relying on rigid word-matching or focusing only on pathological entities, leading to inconsistencies with human assessments. To bridge this gap, we introduce ReFINE, an automatic evaluation metric designed specifically for R2Gen. Our metric utilizes a reward model, guided by our margin-based reward enforcement loss, along with a tailored training data design that enables customization of evaluation criteria to suit user-defined needs. It not only scores reports according to user-specified criteria but also provides detailed sub-scores, enhancing interpretability and allowing users to adjust the criteria between different aspects of reports. Leveraging GPT-4, we designed an easy-to-use data generation pipeline, enabling us to produce extensive training data based on two distinct scoring systems, each containing reports of varying quality along with corresponding scores. These GPT-generated reports are then paired as accepted and rejected samples through our pairing rule to train an LLM towards our fine-grained reward model, which assigns higher rewards to the report with high quality. Our reward-control loss enables this model to simultaneously output multiple individual rewards corresponding to the number of evaluation criteria, with their summation as our final ReFINE. Our experiments demonstrate ReFINE's heightened correlation with human judgments and superior performance in model selection compared to traditional metrics. Notably, our model provides both an overall score and individual scores for each evaluation item, enhancing interpretability. We also demonstrate its flexible training across various evaluation systems.
Problem

Research questions and friction points this paper is trying to address.

Automated radiology report evaluation challenges
Traditional metrics lack human assessment consistency
ReFINE enhances interpretability and user customization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reward-based evaluation framework
GPT-4 data generation pipeline
Margin-based reward enforcement loss
🔎 Similar Papers
No similar papers found.
Yunyi Liu
Yunyi Liu
The University of Sydney
LLMVQAVisual GroundingReport GenerationMedical Image
Y
Yingshu Li
The University of Sydney, Sydney, NSW, Australia
Z
Zhanyu Wang
The University of Sydney, Sydney, NSW, Australia
X
Xinyu Liang
Guangzhou University of Chinese Medicine, Guangzhou, China
Lingqiao Liu
Lingqiao Liu
Associate Professor at the University of Adelaide
computer visionmachine learning
L
Lei Wang
The University of Wollongong, Wollongong, NSW, Australia
Luping Zhou
Luping Zhou
School of Electrical and Computer Engineering, University of Sydney
Medical ImagingComputer VisionMachine Learning