OmniMRI: A Unified Vision--Language Foundation Model for Generalist MRI Interpretation

📅 2025-08-24

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current MRI clinical workflows are fragmented, existing models exhibit poor generalizability, and deep integration of medical imaging with clinical language remains lacking. To address these challenges, we propose the first unified vision-language foundation model spanning the entire MRI workflow—encompassing reconstruction, segmentation, lesion detection, diagnosis, and radiology report generation. Our method employs a four-stage training paradigm: self-supervised visual pretraining, cross-modal alignment, multimodal pretraining, and multi-task instruction tuning—leveraging 220,000 MRI volumes, 19 million slices, and extensive clinical text instructions. The model jointly models anatomical and pathological tasks, enabling end-to-end multimodal inference. Experiments demonstrate strong generalization across diverse anatomical regions and multi-center settings, significantly enhancing integration, versatility, and clinical interpretability of MRI analysis.

Technology Category

Application Category

📝 Abstract

Magnetic Resonance Imaging (MRI) is indispensable in clinical practice but remains constrained by fragmented, multi-stage workflows encompassing acquisition, reconstruction, segmentation, detection, diagnosis, and reporting. While deep learning has achieved progress in individual tasks, existing approaches are often anatomy- or application-specific and lack generalizability across diverse clinical settings. Moreover, current pipelines rarely integrate imaging data with complementary language information that radiologists rely on in routine practice. Here, we introduce OmniMRI, a unified vision-language foundation model designed to generalize across the entire MRI workflow. OmniMRI is trained on a large-scale, heterogeneous corpus curated from 60 public datasets, over 220,000 MRI volumes and 19 million MRI slices, incorporating image-only data, paired vision-text data, and instruction-response data. Its multi-stage training paradigm, comprising self-supervised vision pretraining, vision-language alignment, multimodal pretraining, and multi-task instruction tuning, progressively equips the model with transferable visual representations, cross-modal reasoning, and robust instruction-following capabilities. Qualitative results demonstrate OmniMRI's ability to perform diverse tasks within a single architecture, including MRI reconstruction, anatomical and pathological segmentation, abnormality detection, diagnostic suggestion, and radiology report generation. These findings highlight OmniMRI's potential to consolidate fragmented pipelines into a scalable, generalist framework, paving the way toward foundation models that unify imaging and clinical language for comprehensive, end-to-end MRI interpretation.

Problem

Research questions and friction points this paper is trying to address.

Fragmented multi-stage MRI workflows lack integration across clinical tasks

Existing deep learning approaches lack generalizability across diverse clinical settings

Current pipelines fail to integrate imaging data with complementary language information

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified vision-language foundation model for MRI workflows

Multi-stage training with self-supervised and multimodal pretraining

Single architecture handling reconstruction to report generation

🔎 Similar Papers

No similar papers found.

Authors to Follow