MM-Retinal V2: Transfer an Elite Knowledge Spark into Fundus Vision-Language Pretraining

📅 2025-01-27

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing fundus image–text pretraining methods heavily rely on private clinical data and suffer from insufficient multimodal knowledge fusion. To address this, we introduce MM-Retinal V2—the first high-quality, multimodal (CFP/FFA/OCT) fundus image–text paired dataset—and propose KeepFIT V2, a novel vision-language foundation model. KeepFIT V2 pioneers the “Knowledge Spark” transfer paradigm, integrating text-aware pretraining with a Hybrid Knowledge Injection module that jointly leverages contrastive learning for global semantic alignment and generative learning for fine-grained local detail capture. Crucially, it enables efficient transfer of scarce clinical knowledge to publicly available data. Evaluated under zero-shot, few-shot, and linear-probe settings, KeepFIT V2 trained exclusively on public data consistently outperforms all existing open-source methods and matches the performance of state-of-the-art models trained on large-scale private datasets. Both the MM-Retinal V2 dataset and KeepFIT V2 code are publicly released.

Technology Category

Application Category

📝 Abstract

Vision-language pretraining (VLP) has been investigated to generalize across diverse downstream tasks for fundus image analysis. Although recent methods showcase promising achievements, they significantly rely on large-scale private image-text data but pay less attention to the pretraining manner, which limits their further advancements. In this work, we introduce MM-Retinal V2, a high-quality image-text paired dataset comprising CFP, FFA, and OCT image modalities. Then, we propose a novel fundus vision-language pretraining model, namely KeepFIT V2, which is pretrained by integrating knowledge from the elite data spark into categorical public datasets. Specifically, a preliminary textual pretraining is adopted to equip the text encoder with primarily ophthalmic textual knowledge. Moreover, a hybrid image-text knowledge injection module is designed for knowledge transfer, which is essentially based on a combination of global semantic concepts from contrastive learning and local appearance details from generative learning. Extensive experiments across zero-shot, few-shot, and linear probing settings highlight the generalization and transferability of KeepFIT V2, delivering performance competitive to state-of-the-art fundus VLP models trained on large-scale private image-text datasets. Our dataset and model are publicly available via https://github.com/lxirich/MM-Retinal.

Problem

Research questions and friction points this paper is trying to address.

Visual Language Pre-training

Fundus Photography

Modality Fusion

Innovation

Methods, ideas, or system contributions that make the work stand out.

VLP Technology

Ophthalmic Image Analysis

MM-Retinal V2 Dataset

🔎 Similar Papers

VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge