Medical Large Vision Language Models with Multi-Image Visual Ability

📅 2025-05-25

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing large vision-language models (LVLMs) for healthcare perform well on single-image question answering but struggle with multi-image clinical tasks requiring temporal reasoning, cross-image comparison, and coreference resolution. Method: We introduce Med-MIM—the first instruction-tuning dataset for medical multi-image understanding—comprising 83.2K samples that systematically define and evaluate multi-image visual capabilities. Leveraging Med-MIM, we develop two specialized models: MIM-LLaVA-Med and Med-Mantis, extending base architectures via instruction fine-tuning, multi-image encoder alignment, and enhanced cross-modal fusion. Contribution/Results: On our newly established Med-MIM benchmark, both models achieve state-of-the-art performance across held-in and held-out subsets, significantly outperforming eight leading LVLMs. This work bridges a critical gap in medical LVLMs by enabling robust multi-image clinical reasoning—establishing the first dedicated framework, dataset, and models for scalable, instruction-aligned multi-image medical understanding.

Technology Category

Application Category

📝 Abstract

Medical large vision-language models (LVLMs) have demonstrated promising performance across various single-image question answering (QA) benchmarks, yet their capability in processing multi-image clinical scenarios remains underexplored. Unlike single image based tasks, medical tasks involving multiple images often demand sophisticated visual understanding capabilities, such as temporal reasoning and cross-modal analysis, which are poorly supported by current medical LVLMs. To bridge this critical gap, we present the Med-MIM instruction dataset, comprising 83.2K medical multi-image QA pairs that span four types of multi-image visual abilities (temporal understanding, reasoning, comparison, co-reference). Using this dataset, we fine-tune Mantis and LLaVA-Med, resulting in two specialized medical VLMs: MIM-LLaVA-Med and Med-Mantis, both optimized for multi-image analysis. Additionally, we develop the Med-MIM benchmark to comprehensively evaluate the medical multi-image understanding capabilities of LVLMs. We assess eight popular LVLMs, including our two models, on the Med-MIM benchmark. Experimental results show that both Med-Mantis and MIM-LLaVA-Med achieve superior performance on the held-in and held-out subsets of the Med-MIM benchmark, demonstrating that the Med-MIM instruction dataset effectively enhances LVLMs' multi-image understanding capabilities in the medical domain.

Problem

Research questions and friction points this paper is trying to address.

Enhancing multi-image analysis in medical LVLMs

Addressing gaps in temporal and cross-modal reasoning

Developing datasets and benchmarks for medical multi-image QA

Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed Med-MIM multi-image QA dataset

Fine-tuned Mantis and LLaVA-Med models

Created Med-MIM benchmark for evaluation

🔎 Similar Papers

No similar papers found.

Authors to Follow