Multi-modal Vision Pre-training for Medical Image Analysis

📅 2024-10-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing self-supervised learning methods primarily focus on single-modality images and struggle to capture intrinsic cross-modal correlations in multimodal medical imaging—such as multiparametric MRI (mpMRI). To address this, we propose the first large-scale self-supervised pretraining framework specifically designed for brain mpMRI, leveraging 2.4 million images from 3,755 patients and 16,022 scans. Our method introduces a novel tri-agent collaborative learning paradigm: (1) cross-modal image reconstruction, (2) modality-aware contrastive learning, and (3) modality template distillation—systematically modeling structured inter-modal relationships for the first time. It integrates Transformer-based architectures, multimodal feature alignment, and knowledge distillation. Evaluated on six segmentation tasks, our approach improves Dice scores by 0.28–14.47%; on four classification tasks, it boosts accuracy by 0.65–18.07%, significantly outperforming existing pretraining paradigms.

Technology Category

Application Category

📝 Abstract
Self-supervised learning has greatly facilitated medical image analysis by suppressing the training data requirement for real-world applications. Current paradigms predominantly rely on self-supervision within uni-modal image data, thereby neglecting the inter-modal correlations essential for effective learning of cross-modal image representations. This limitation is particularly significant for naturally grouped multi-modal data, e.g., multi-parametric MRI scans for a patient undergoing various functional imaging protocols in the same study. To bridge this gap, we conduct a novel multi-modal image pre-training with three proxy tasks to facilitate the learning of cross-modality representations and correlations using multi-modal brain MRI scans (over 2.4 million images in 16,022 scans of 3,755 patients), i.e., cross-modal image reconstruction, modality-aware contrastive learning, and modality template distillation. To demonstrate the generalizability of our pre-trained model, we conduct extensive experiments on various benchmarks with ten downstream tasks. The superior performance of our method is reported in comparison to state-of-the-art pre-training methods, with Dice Score improvement of 0.28%-14.47% across six segmentation benchmarks and a consistent accuracy boost of 0.65%-18.07% in four individual image classification tasks.
Problem

Research questions and friction points this paper is trying to address.

Addresses limited cross-modal learning in medical image analysis.
Proposes multi-modal pre-training with three novel proxy tasks.
Demonstrates improved performance across multiple downstream tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal image pre-training with proxy tasks
Cross-modality representation learning via MRI scans
Enhanced segmentation and classification accuracy benchmarks
🔎 Similar Papers
No similar papers found.
Shaohao Rui
Shaohao Rui
PhD student, SJTU & SHAI Lab & SII
World ModelsVideo GenVLMLLM
Lingzhi Chen
Lingzhi Chen
Unknown affiliation
Z
Zhenyu Tang
Shanghai AI Laboratory, Shanghai 200030, China; School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
L
Lilong Wang
Shanghai AI Laboratory, Shanghai 200030, China
M
Mianxin Liu
Shanghai AI Laboratory, Shanghai 200030, China
Shaoting Zhang
Shaoting Zhang
Shanghai AI Lab; SenseTime Research
Medical Image AnalysisComputer VisionFoundation Models
Xiaosong Wang
Xiaosong Wang
Shanghai AI Laboratory
Medical Image AnalysisComputer VisionVision and Language