MedSG-Bench: A Benchmark for Medical Image Sequences Grounding

📅 2025-05-17

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing medical visual grounding benchmarks are confined to single-image settings, failing to support clinically critical tasks such as longitudinal and cross-modal lesion tracking and progression analysis—requiring fine-grained semantic alignment and contextual awareness. To address this gap, we introduce MedSG-Bench, the first visual grounding benchmark dedicated to medical image sequences. We formally define sequence-level grounding tasks and propose two novel paradigms: differential grounding (identifying changes across timepoints) and consistency grounding (localizing stable features). We construct MedSG-188K, a large-scale instruction-tuning dataset comprising 10 imaging modalities, 76 public datasets, and 9,630 question-answer pairs. Additionally, we release MedSeq-Grounder, a specialized sequence grounding model. Comprehensive evaluation via VQA-style metrics and MLLM benchmarks reveals substantial performance deficits of current multimodal large models on sequential grounding. All resources—including benchmarks, data, and models—are publicly released to advance medical temporal reasoning.

Technology Category

Application Category

📝 Abstract

Visual grounding is essential for precise perception and reasoning in multimodal large language models (MLLMs), especially in medical imaging domains. While existing medical visual grounding benchmarks primarily focus on single-image scenarios, real-world clinical applications often involve sequential images, where accurate lesion localization across different modalities and temporal tracking of disease progression (e.g., pre- vs. post-treatment comparison) require fine-grained cross-image semantic alignment and context-aware reasoning. To remedy the underrepresentation of image sequences in existing medical visual grounding benchmarks, we propose MedSG-Bench, the first benchmark tailored for Medical Image Sequences Grounding. It comprises eight VQA-style tasks, formulated into two paradigms of the grounding tasks, including 1) Image Difference Grounding, which focuses on detecting change regions across images, and 2) Image Consistency Grounding, which emphasizes detection of consistent or shared semantics across sequential images. MedSG-Bench covers 76 public datasets, 10 medical imaging modalities, and a wide spectrum of anatomical structures and diseases, totaling 9,630 question-answer pairs. We benchmark both general-purpose MLLMs (e.g., Qwen2.5-VL) and medical-domain specialized MLLMs (e.g., HuatuoGPT-vision), observing that even the advanced models exhibit substantial limitations in medical sequential grounding tasks. To advance this field, we construct MedSG-188K, a large-scale instruction-tuning dataset tailored for sequential visual grounding, and further develop MedSeq-Grounder, an MLLM designed to facilitate future research on fine-grained understanding across medical sequential images. The benchmark, dataset, and model are available at https://huggingface.co/MedSG-Bench

Problem

Research questions and friction points this paper is trying to address.

Addressing lack of medical image sequence grounding benchmarks

Enabling lesion tracking across modalities and time

Improving cross-image semantic alignment in clinical analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

First benchmark for medical image sequences grounding

Large-scale instruction-tuning dataset MedSG-188K

Developed MedSeq-Grounder for sequential image understanding

🔎 Similar Papers

MedRG: Medical Report Grounding with Multi-modal Large Language Model