A Comparison of Object Detection and Phrase Grounding Models in Chest X-ray Abnormality Localization using Eye-tracking Data

📅 2025-03-02

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

This study investigates whether textual information can improve both accuracy and clinical interpretability in localizing abnormalities in chest X-rays. We introduce the first report-sentence-to-image-region alignment benchmark, automatically constructed from real radiologists’ eye-tracking data. We systematically compare cross-modal phrase grounding—using an enhanced MDETR—with conventional class-level object detection (Faster R-CNN). Results demonstrate that phrase grounding significantly outperforms class-level detection in localization accuracy (mIoU = 0.36 vs. 0.20) and clinical interpretability (Containment Ratio = 0.48 vs. 0.26). Our key contribution is the first integration of eye-tracking into medical visual grounding evaluation, empirically showing that fine-grained text guidance effectively bridges the semantic gap between image understanding and clinical decision-making. This work establishes a novel paradigm for interpretable AI-assisted diagnosis in radiology.

Technology Category

Application Category

📝 Abstract

Chest diseases rank among the most prevalent and dangerous global health issues. Object detection and phrase grounding deep learning models interpret complex radiology data to assist healthcare professionals in diagnosis. Object detection locates abnormalities for classes, while phrase grounding locates abnormalities for textual descriptions. This paper investigates how text enhances abnormality localization in chest X-rays by comparing the performance and explainability of these two tasks. To establish an explainability baseline, we proposed an automatic pipeline to generate image regions for report sentences using radiologists' eye-tracking data. The better performance - mIoU = 0.36 vs. 0.20 - and explainability - Containment ratio 0.48 vs. 0.26 - of the phrase grounding model infers the effectiveness of text in enhancing chest X-ray abnormality localization.

Problem

Research questions and friction points this paper is trying to address.

Compares object detection and phrase grounding models in chest X-ray analysis.

Investigates text's role in improving abnormality localization accuracy.

Uses eye-tracking data to enhance model explainability and performance.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Phrase grounding model enhances X-ray abnormality localization.

Eye-tracking data generates image regions for report sentences.

Text improves performance and explainability in radiology diagnostics.

🔎 Similar Papers

Multi-modal vision-language model for generalizable annotation-free pathology localization and clinical diagnosis