Medical Large Vision Language Models with Multi-Image Visual Ability

📅 2025-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing large vision-language models (LVLMs) for healthcare perform well on single-image question answering but struggle with multi-image clinical tasks requiring temporal reasoning, cross-image comparison, and coreference resolution. Method: We introduce Med-MIM—the first instruction-tuning dataset for medical multi-image understanding—comprising 83.2K samples that systematically define and evaluate multi-image visual capabilities. Leveraging Med-MIM, we develop two specialized models: MIM-LLaVA-Med and Med-Mantis, extending base architectures via instruction fine-tuning, multi-image encoder alignment, and enhanced cross-modal fusion. Contribution/Results: On our newly established Med-MIM benchmark, both models achieve state-of-the-art performance across held-in and held-out subsets, significantly outperforming eight leading LVLMs. This work bridges a critical gap in medical LVLMs by enabling robust multi-image clinical reasoning—establishing the first dedicated framework, dataset, and models for scalable, instruction-aligned multi-image medical understanding.

Technology Category

Application Category

📝 Abstract
Medical large vision-language models (LVLMs) have demonstrated promising performance across various single-image question answering (QA) benchmarks, yet their capability in processing multi-image clinical scenarios remains underexplored. Unlike single image based tasks, medical tasks involving multiple images often demand sophisticated visual understanding capabilities, such as temporal reasoning and cross-modal analysis, which are poorly supported by current medical LVLMs. To bridge this critical gap, we present the Med-MIM instruction dataset, comprising 83.2K medical multi-image QA pairs that span four types of multi-image visual abilities (temporal understanding, reasoning, comparison, co-reference). Using this dataset, we fine-tune Mantis and LLaVA-Med, resulting in two specialized medical VLMs: MIM-LLaVA-Med and Med-Mantis, both optimized for multi-image analysis. Additionally, we develop the Med-MIM benchmark to comprehensively evaluate the medical multi-image understanding capabilities of LVLMs. We assess eight popular LVLMs, including our two models, on the Med-MIM benchmark. Experimental results show that both Med-Mantis and MIM-LLaVA-Med achieve superior performance on the held-in and held-out subsets of the Med-MIM benchmark, demonstrating that the Med-MIM instruction dataset effectively enhances LVLMs' multi-image understanding capabilities in the medical domain.
Problem

Research questions and friction points this paper is trying to address.

Enhancing multi-image analysis in medical LVLMs
Addressing gaps in temporal and cross-modal reasoning
Developing datasets and benchmarks for medical multi-image QA
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed Med-MIM multi-image QA dataset
Fine-tuned Mantis and LLaVA-Med models
Created Med-MIM benchmark for evaluation
🔎 Similar Papers
No similar papers found.
X
Xikai Yang
Dept. of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China
Juzheng Miao
Juzheng Miao
PhD student, The Chinese University of Hong Kong
Medical image analysislabel-efficient learningreinforcement learningcausality
Y
Yuchen Yuan
Dept. of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China
J
Jiaze Wang
Dept. of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China
Q
Q. Dou
Dept. of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China; Institute of Medical Intelligence and XR, The Chinese University of Hong Kong, Hong Kong, China
J
Jinpeng Li
Dept. of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China
P
Pheng-Ann Heng
Dept. of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China; Institute of Medical Intelligence and XR, The Chinese University of Hong Kong, Hong Kong, China