🤖 AI Summary
Existing fundus image–text pretraining methods heavily rely on private clinical data and suffer from insufficient multimodal knowledge fusion. To address this, we introduce MM-Retinal V2—the first high-quality, multimodal (CFP/FFA/OCT) fundus image–text paired dataset—and propose KeepFIT V2, a novel vision-language foundation model. KeepFIT V2 pioneers the “Knowledge Spark” transfer paradigm, integrating text-aware pretraining with a Hybrid Knowledge Injection module that jointly leverages contrastive learning for global semantic alignment and generative learning for fine-grained local detail capture. Crucially, it enables efficient transfer of scarce clinical knowledge to publicly available data. Evaluated under zero-shot, few-shot, and linear-probe settings, KeepFIT V2 trained exclusively on public data consistently outperforms all existing open-source methods and matches the performance of state-of-the-art models trained on large-scale private datasets. Both the MM-Retinal V2 dataset and KeepFIT V2 code are publicly released.
📝 Abstract
Vision-language pretraining (VLP) has been investigated to generalize across diverse downstream tasks for fundus image analysis. Although recent methods showcase promising achievements, they significantly rely on large-scale private image-text data but pay less attention to the pretraining manner, which limits their further advancements. In this work, we introduce MM-Retinal V2, a high-quality image-text paired dataset comprising CFP, FFA, and OCT image modalities. Then, we propose a novel fundus vision-language pretraining model, namely KeepFIT V2, which is pretrained by integrating knowledge from the elite data spark into categorical public datasets. Specifically, a preliminary textual pretraining is adopted to equip the text encoder with primarily ophthalmic textual knowledge. Moreover, a hybrid image-text knowledge injection module is designed for knowledge transfer, which is essentially based on a combination of global semantic concepts from contrastive learning and local appearance details from generative learning. Extensive experiments across zero-shot, few-shot, and linear probing settings highlight the generalization and transferability of KeepFIT V2, delivering performance competitive to state-of-the-art fundus VLP models trained on large-scale private image-text datasets. Our dataset and model are publicly available via https://github.com/lxirich/MM-Retinal.