🤖 AI Summary
This study addresses the misalignment between clinical findings—such as location, laterality, and severity—and corresponding anatomical regions in chest X-ray images during generative AI report evaluation. To resolve this, we propose the first anatomy-grounded, multimodal report quality assessment method. Our approach integrates clinical named entity recognition, fine-grained relation extraction, cross-modal phrase-to-image grounding, and multi-source consistency scoring to achieve phrase-level localization of textual descriptions onto anatomical regions in chest radiographs and joint validation. Compared to conventional text-only metrics (e.g., BLEU, BERTScore), our method demonstrates significantly higher correlation with radiologist expert ratings on a standard ground-truth dataset (p < 0.01). It overcomes key limitations of traditional evaluation paradigms by enabling interpretable, anatomy-aware, and empirically verifiable assessment—establishing a novel benchmark for clinically trustworthy AI-assisted diagnostic reporting.
📝 Abstract
Several evaluation metrics have been developed recently to automatically assess the quality of generative AI reports for chest radiographs based only on textual information using lexical, semantic, or clinical named entity recognition methods. In this paper, we develop a new method of report quality evaluation by first extracting fine-grained finding patterns capturing the location, laterality, and severity of a large number of clinical findings. We then performed phrasal grounding to localize their associated anatomical regions on chest radiograph images. The textual and visual measures are then combined to rate the quality of the generated reports. We present results that compare this evaluation metric with other textual metrics on gold standard datasets.