🤖 AI Summary
Existing medical visual grounding benchmarks are confined to single-image settings, failing to support clinically critical tasks such as longitudinal and cross-modal lesion tracking and progression analysis—requiring fine-grained semantic alignment and contextual awareness. To address this gap, we introduce MedSG-Bench, the first visual grounding benchmark dedicated to medical image sequences. We formally define sequence-level grounding tasks and propose two novel paradigms: differential grounding (identifying changes across timepoints) and consistency grounding (localizing stable features). We construct MedSG-188K, a large-scale instruction-tuning dataset comprising 10 imaging modalities, 76 public datasets, and 9,630 question-answer pairs. Additionally, we release MedSeq-Grounder, a specialized sequence grounding model. Comprehensive evaluation via VQA-style metrics and MLLM benchmarks reveals substantial performance deficits of current multimodal large models on sequential grounding. All resources—including benchmarks, data, and models—are publicly released to advance medical temporal reasoning.
📝 Abstract
Visual grounding is essential for precise perception and reasoning in multimodal large language models (MLLMs), especially in medical imaging domains. While existing medical visual grounding benchmarks primarily focus on single-image scenarios, real-world clinical applications often involve sequential images, where accurate lesion localization across different modalities and temporal tracking of disease progression (e.g., pre- vs. post-treatment comparison) require fine-grained cross-image semantic alignment and context-aware reasoning. To remedy the underrepresentation of image sequences in existing medical visual grounding benchmarks, we propose MedSG-Bench, the first benchmark tailored for Medical Image Sequences Grounding. It comprises eight VQA-style tasks, formulated into two paradigms of the grounding tasks, including 1) Image Difference Grounding, which focuses on detecting change regions across images, and 2) Image Consistency Grounding, which emphasizes detection of consistent or shared semantics across sequential images. MedSG-Bench covers 76 public datasets, 10 medical imaging modalities, and a wide spectrum of anatomical structures and diseases, totaling 9,630 question-answer pairs. We benchmark both general-purpose MLLMs (e.g., Qwen2.5-VL) and medical-domain specialized MLLMs (e.g., HuatuoGPT-vision), observing that even the advanced models exhibit substantial limitations in medical sequential grounding tasks. To advance this field, we construct MedSG-188K, a large-scale instruction-tuning dataset tailored for sequential visual grounding, and further develop MedSeq-Grounder, an MLLM designed to facilitate future research on fine-grained understanding across medical sequential images. The benchmark, dataset, and model are available at https://huggingface.co/MedSG-Bench