Unimedvl: Unifying Medical Multimodal Understanding And Generation Through Observation-Knowledge-Analysis

📅 2025-10-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current medical AI models suffer from a fundamental disconnect between visual understanding and generation capabilities: vision-language models excel at interpretation but cannot synthesize visual content, while generative models lack interpretability and textual reasoning. This impedes coherent multimodal representation learning and cross-task synergy. To address this, we propose UniMedVL—a unified medical vision-language model featuring a hierarchical “Observation–Knowledge–Analysis” architecture. For the first time, it jointly models both medical image understanding (e.g., radiology report generation) and generation (e.g., lesion segmentation masks, synthetic pathology images) within a single framework. Leveraging a large-scale, 5.6-million-sample multimodal dataset and a progressive curriculum learning strategy, UniMedVL enables bidirectional knowledge transfer across tasks. It achieves state-of-the-art performance on five understanding benchmarks and matches or rivals task-specific models across eight diverse generation tasks, demonstrating the efficacy and generalizability of unified multimodal modeling in medical AI.

Technology Category

Application Category

📝 Abstract
Medical diagnostic applications require models that can process multimodal medical inputs (images, patient histories, lab results) and generate diverse outputs including both textual reports and visual content (annotations, segmentation masks, and images). Despite this need, existing medical AI systems disrupt this unified process: medical image understanding models interpret images but cannot generate visual outputs, while medical image generation models synthesize images but cannot provide textual explanations. This leads to gaps in data representation, feature integration, and task-level multimodal capabilities. To this end, we propose a multi-level framework that draws inspiration from diagnostic workflows through the Observation-Knowledge-Analysis (OKA) paradigm. Specifically, at the observation level, we construct UniMed-5M, a dataset comprising over 5.6M samples that reformat diverse unimodal data into multimodal pairs for foundational observation. At the knowledge level, we propose Progressive Curriculum Learning that systematically introduces medical multimodal knowledge. At the analysis level, we introduce UniMedVL, the first medical unified multimodal model for the simultaneous analysis of image understanding and generation tasks within a single architecture. UniMedVL achieves superior performance on five medical image understanding benchmarks, while matching specialized models in generation quality across eight medical imaging modalities. Crucially, our unified architecture enables bidirectional knowledge sharing: generation tasks enhance visual understanding features, demonstrating that integrating traditionally separate capabilities within a single medical framework unlocks improvements across diverse medical vision-language tasks. Code is available at https://github.com/uni-medical/UniMedVL.
Problem

Research questions and friction points this paper is trying to address.

Unifying medical multimodal understanding and generation tasks
Bridging gaps between separate medical AI interpretation and synthesis systems
Enabling bidirectional knowledge sharing across medical vision-language tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified multimodal model for medical understanding and generation
Progressive curriculum learning for medical knowledge integration
Observation-Knowledge-Analysis framework enabling bidirectional knowledge sharing
🔎 Similar Papers
2024-05-27International Conference on Information and Knowledge ManagementCitations: 4
J
Junzhi Ning
Shanghai Artificial Intelligence Laboratory
W
Wei Li
Shanghai Artificial Intelligence Laboratory, Shanghai Jiao Tong University
C
Cheng Tang
Shanghai Artificial Intelligence Laboratory, Shanghai Institute of Optics and Fine Mechanics
J
Jiashi Lin
Shanghai Artificial Intelligence Laboratory
Chenglong Ma
Chenglong Ma
Fudan University; Shanghai Innovation Institute
multi-modal modelsgenerative modelsmedical image analysis
C
Chaoyang Zhang
Shanghai Innovation Institute
J
Jiyao Liu
Shanghai Artificial Intelligence Laboratory, Fudan University
Y
Ying Chen
Shanghai Artificial Intelligence Laboratory
S
Shujian Gao
Shanghai Artificial Intelligence Laboratory, Fudan University
Lihao Liu
Lihao Liu
Amazon
LLM-based AgentHealthcare AI
Yuandong Pu
Yuandong Pu
SJTU,Shanghai AI Laboratory
Computer Vision
H
Huihui Xu
Shanghai Artificial Intelligence Laboratory, The Hong Kong University of Science and Technology
Chenhui Gou
Chenhui Gou
3nd Y PhD candidate, Monash University;
LLMMultimodality
Z
Ziyan Huang
Shanghai Artificial Intelligence Laboratory
Yi Xin
Yi Xin
California Institute of Technology
Industrial OrganizationEconometrics
Q
Qi Qin
Shanghai Artificial Intelligence Laboratory
Zhongying Deng
Zhongying Deng
University of Cambridge
Deep LearningMulti-modal LearningComputer VisionMedical Image Analysis
D
Diping Song
Shanghai Artificial Intelligence Laboratory
B
Bin Fu
Shanghai Artificial Intelligence Laboratory
G
Guang Yang
Imperial College London
Yuanfeng Ji
Yuanfeng Ji
Stanford; HKU
Computer visionMedical Image Analysis
Tianbin Li
Tianbin Li
Shanghai Artificial Intelligence Laboratory
Machine LearningComputer VisionGeneral Intelligence
Yanzhou Su
Yanzhou Su
FZU, UESTC
medical image analysis
J
Jin Ye
Shanghai Artificial Intelligence Laboratory, Monash University
S
Shixiang Tang
Shanghai Artificial Intelligence Laboratory