Generalized Medical Phrase Grounding

📅 2025-11-30

📈 Citations: 0

✨ Influential: 0

career value

137K/year

🤖 AI Summary

Medical phrase grounding (MPG) traditionally assumes a one-to-one mapping between each sentence and a single image region, failing to handle clinically common cases such as multi-region findings, non-diagnostic descriptions, negations, and normal anatomical statements—phrases inherently ungroundable to bounding boxes. To address this, we propose *generalized MPG*, supporting zero, one, or multiple region mappings and relaxing the rigid single-box constraint. Our method employs a two-stage training paradigm: first, cross-modal alignment is pre-trained on sentence–anatomical region pair data; second, fine-tuning is performed on human-annotated bounding box data. We further introduce a learnable scoring mechanism to enhance grounding robustness. Crucially, our approach integrates seamlessly—without retraining—with report generation models. Evaluated on PadChest-GR and MS-CXR, it significantly outperforms state-of-the-art methods, demonstrates strong zero-shot transfer capability, and substantially reduces reliance on labor-intensive, fine-grained bounding box annotations.

Technology Category

Application Category

📝 Abstract

Medical phrase grounding (MPG) maps textual descriptions of radiological findings to corresponding image regions. These grounded reports are easier to interpret, especially for non-experts. Existing MPG systems mostly follow the referring expression comprehension (REC) paradigm and return exactly one bounding box per phrase. Real reports often violate this assumption. They contain multi-region findings, non-diagnostic text, and non-groundable phrases, such as negations or descriptions of normal anatomy. Motivated by this, we reformulate the task as generalised medical phrase grounding (GMPG), where each sentence is mapped to zero, one, or multiple scored regions. To realise this formulation, we introduce the first GMPG model: MedGrounder. We adopted a two-stage training regime: pre-training on report sentence--anatomy box alignment datasets and fine-tuning on report sentence--human annotated box datasets. Experiments on PadChest-GR and MS-CXR show that MedGrounder achieves strong zero-shot transfer and outperforms REC-style and grounded report generation baselines on multi-region and non-groundable phrases, while using far fewer human box annotations. Finally, we show that MedGrounder can be composed with existing report generators to produce grounded reports without retraining the generator.

Problem

Research questions and friction points this paper is trying to address.

Reformulate medical phrase grounding to handle zero, one, or multiple image regions per sentence

Address multi-region findings, non-diagnostic text, and non-groundable phrases in radiology reports

Develop a model requiring fewer human annotations while improving accuracy on complex phrases

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generalized phrase grounding with zero-to-multiple scored regions

Two-stage training: pre-training on anatomy alignment then fine-tuning

Composition with existing generators for grounded reports without retraining

🔎 Similar Papers

MedRG: Medical Report Grounding with Multi-modal Large Language Model