MM-Retinal V2: Transfer an Elite Knowledge Spark into Fundus Vision-Language Pretraining

📅 2025-01-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing fundus image–text pretraining methods heavily rely on private clinical data and suffer from insufficient multimodal knowledge fusion. To address this, we introduce MM-Retinal V2—the first high-quality, multimodal (CFP/FFA/OCT) fundus image–text paired dataset—and propose KeepFIT V2, a novel vision-language foundation model. KeepFIT V2 pioneers the “Knowledge Spark” transfer paradigm, integrating text-aware pretraining with a Hybrid Knowledge Injection module that jointly leverages contrastive learning for global semantic alignment and generative learning for fine-grained local detail capture. Crucially, it enables efficient transfer of scarce clinical knowledge to publicly available data. Evaluated under zero-shot, few-shot, and linear-probe settings, KeepFIT V2 trained exclusively on public data consistently outperforms all existing open-source methods and matches the performance of state-of-the-art models trained on large-scale private datasets. Both the MM-Retinal V2 dataset and KeepFIT V2 code are publicly released.

Technology Category

Application Category

📝 Abstract
Vision-language pretraining (VLP) has been investigated to generalize across diverse downstream tasks for fundus image analysis. Although recent methods showcase promising achievements, they significantly rely on large-scale private image-text data but pay less attention to the pretraining manner, which limits their further advancements. In this work, we introduce MM-Retinal V2, a high-quality image-text paired dataset comprising CFP, FFA, and OCT image modalities. Then, we propose a novel fundus vision-language pretraining model, namely KeepFIT V2, which is pretrained by integrating knowledge from the elite data spark into categorical public datasets. Specifically, a preliminary textual pretraining is adopted to equip the text encoder with primarily ophthalmic textual knowledge. Moreover, a hybrid image-text knowledge injection module is designed for knowledge transfer, which is essentially based on a combination of global semantic concepts from contrastive learning and local appearance details from generative learning. Extensive experiments across zero-shot, few-shot, and linear probing settings highlight the generalization and transferability of KeepFIT V2, delivering performance competitive to state-of-the-art fundus VLP models trained on large-scale private image-text datasets. Our dataset and model are publicly available via https://github.com/lxirich/MM-Retinal.
Problem

Research questions and friction points this paper is trying to address.

Visual Language Pre-training
Fundus Photography
Modality Fusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

VLP Technology
Ophthalmic Image Analysis
MM-Retinal V2 Dataset
🔎 Similar Papers
No similar papers found.
R
Ruiqi Wu
School of Computer Science and Engineering, Southeast University, Nanjing, China
N
Na Su
Department of Ophthalmology, The First Affiliated Hospital of Nanjing Medical University, Nanjing, China
C
Chenran Zhang
School of Computer Science and Engineering, Southeast University, Nanjing, China
Tengfei Ma
Tengfei Ma
Stony Brook University
Natural Language ProcessingMachine LearningHealthcareGraph Neural Networks
T
Tao Zhou
School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China
Z
Zhiting Cui
Department of Ophthalmology, The First Affiliated Hospital of Nanjing Medical University, Nanjing, China
N
Nianfeng Tang
Department of Ophthalmology, The First Affiliated Hospital of Nanjing Medical University, Nanjing, China
T
Tianyu Mao
Department of Ophthalmology, The First Affiliated Hospital of Nanjing Medical University, Nanjing, China
Y
Yi Zhou
School of Computer Science and Engineering, Southeast University, Nanjing, China
Wen Fan
Wen Fan
University of California, Berkeley
Nanotechnology - Vanadium dioxide - 2D materials
Tianxing Wu
Tianxing Wu
Ph.D. Student, Nanyang technological university
Computer Vision
S
Shenqi Jing
Department of Ophthalmology, The First Affiliated Hospital of Nanjing Medical University, Nanjing, China
Huazhu Fu
Huazhu Fu
Principal Scientist, IHPC, A*STAR
Medical Image AnalysisAI for HealthcareMedical AITrustworthy AI