OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding

📅 2025-04-20

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Current medical vision-language models (Med-VLMs) rely on separate, modality-specific encoders, hindering unified understanding of text with 2D/3D medical images and videos, and limiting generalization. To address this, we propose OmniV-Med—the first unified architecture for joint multimodal medical understanding across 2D, 3D, and video data. Our method introduces a rotation-aware positional adaptive encoder for cross-modal and cross-resolution modeling, and a medical-aware token pruning mechanism that reduces visual tokens by 60% while preserving performance. We further construct OmniV-Med-Instruct, a large-scale multimodal instruction dataset, and perform multimodal instruction tuning. OmniV-Med-7B achieves state-of-the-art results on seven diverse 2D/3D/video medical benchmarks. The lightweight OmniV-Med-1.5B variant can be trained on just eight RTX 3090 GPUs and enables efficient long-video inference.

Technology Category

Application Category

📝 Abstract

The practical deployment of medical vision-language models (Med-VLMs) necessitates seamless integration of textual data with diverse visual modalities, including 2D/3D images and videos, yet existing models typically employ separate encoders for different modalities. To address this limitation, we present OmniV-Med, a unified framework for multimodal medical understanding. Our technical contributions are threefold: First, we construct OmniV-Med-Instruct, a comprehensive multimodal medical dataset containing 252K instructional samples spanning 14 medical image modalities and 11 clinical tasks. Second, we devise a rotary position-adaptive encoder that processes multi-resolution 2D/3D images and videos within a unified architecture, diverging from conventional modality-specific encoders. Third, we introduce a medical-aware token pruning mechanism that exploits spatial-temporal redundancy in volumetric data (e.g., consecutive CT slices) and medical videos, effectively reducing 60% of visual tokens without performance degradation. Empirical evaluations demonstrate that OmniV-Med-7B achieves state-of-the-art performance on 7 benchmarks spanning 2D/3D medical imaging and video understanding tasks. Notably, our lightweight variant (OmniV-Med-1.5B) attains comparable performance while requiring only 8 RTX3090 GPUs for training and supporting efficient long-video inference. Data, code and model will be released.

Problem

Research questions and friction points this paper is trying to address.

Integrating diverse medical visual modalities with text using a unified model

Overcoming limitations of separate encoders for different medical image types

Reducing computational costs while maintaining performance in medical vision tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified rotary position-adaptive encoder for 2D/3D images and videos

Medical-aware token pruning reduces 60% visual tokens

Comprehensive multimodal dataset with 252K instructional samples

🔎 Similar Papers

Multi-modal vision-language model for generalizable annotation-free pathology localization and clinical diagnosis