π€ AI Summary
Medical phrase grounding (MPG) suffers from limited generalizability and poor zero-shot capability due to the scarcity of high-quality annotated data. To address this, we propose Anatomical Grounding Pretraining (AGP)βthe first anatomy-aware pretraining paradigm specifically designed for MPG. AGP leverages large-scale anatomical segmentation datasets (e.g., Chest ImaGenome) and employs contrastive learning to jointly model radiology reports and image-based anatomical regions, enabling fine-grained alignment between textual phrases and anatomical structures. Crucially, AGP requires no task-specific annotations for pretraining. On the MS-CXR benchmark, it achieves state-of-the-art zero-shot localization performance; after fine-tuning, it attains an mIoU of 61.2%, establishing a new SOTA. Our core contribution lies in formulating and realizing an anatomy-informed multimodal pretraining objective that effectively bridges linguistic descriptions with anatomical image space, significantly enhancing transferability and data efficiency in MPG.
π Abstract
Medical Phrase Grounding (MPG) maps radiological findings described in medical reports to specific regions in medical images. The primary obstacle hindering progress in MPG is the scarcity of annotated data available for training and validation. We propose anatomical grounding as an in-domain pre-training task that aligns anatomical terms with corresponding regions in medical images, leveraging large-scale datasets such as Chest ImaGenome. Our empirical evaluation on MS-CXR demonstrates that anatomical grounding pre-training significantly improves performance in both a zero-shot learning and fine-tuning setting, outperforming state-of-the-art MPG models. Our fine-tuned model achieved state-of-the-art performance on MS-CXR with an mIoU of 61.2, demonstrating the effectiveness of anatomical grounding pre-training for MPG.