🤖 AI Summary
This work addresses the challenge of evaluating high-risk clinical errors—such as missed findings, hallucinations, and polarity reversals—in automatically generated radiology reports. The authors propose an interpretable, structured evidence-based optimal transport framework for offline audit of report quality. By integrating structured clinical evidence with entropy-regularized optimal transport, the method decomposes and aligns evidence units, models clinically relevant discrepancies, and incorporates a monotonic risk model alongside ReXVal-driven feature selection. This enables auditable, ranking-oriented evaluation without requiring fine-tuning of large language models. On RadEvalX, the approach achieves a Spearman correlation of 0.715 with annotated error burden; under the ReXErr-v1 stress test, it attains an AUROC of 0.768 and a pairwise win rate of 0.990, substantially outperforming existing metrics and the GREEN-radllama2-7B baseline.
📝 Abstract
Automatic evaluation is critical for high-stakes text generation, where errors often involve omitted findings, hallucinated content, polarity reversals, location changes, uncertainty mismatches, and temporal-comparison errors rather than low surface similarity alone. Radiology report generation provides a challenging test case because generated reports must preserve structured clinical evidence across sources. We present RadOT-Eval, an interpretable structured-evidence optimal transport framework for offline auditing of radiology report generation. RadOT-Eval decomposes reference and candidate reports into attribute-structured clinical evidence units, aligns corresponding evidence using entropy-regularized optimal transport, and uses clinically meaningful side-channel discrepancies in a monotone risk model to predict error burden. All transport, feature, and readout choices are selected using the ReXVal dataset, and the frozen system is evaluated on the independent RadEvalX dataset. RadOT-Eval achieves Spearman correlations of 0.715, 0.548, and 0.399 with total, clinically significant, and clinically insignificant annotated error burden, respectively, yielding higher point estimates than standard evaluation metrics and the open-source large language model (LLM)-based evaluator GREEN-radllama2-7B. In a frozen auxiliary corruption-sensitivity stress test on ReXErr-v1, RadOT-Eval achieves 0.768 AUROC and a 0.990 corrupted-greater-than-clean paired win rate. These results show that structured evidence transport provides an auditable, rank-oriented evaluation tool for high-stakes generated clinical text under ReXVal-only model selection and frozen RadEvalX testing.