π€ AI Summary
Existing general-purpose and medical multimodal large language models exhibit limited performance on ophthalmology-specific tasks and lack open-source, domain-specialized solutions. To address this gap, this work proposes VOLMO-2Bβan architecture-agnostic, open-data multimodal large language model framework tailored for ophthalmology. It introduces a novel three-stage training strategy: knowledge-rich pretraining on large-scale medical imageβtext pairs from literature, multitask fine-tuning on multi-disease annotated data, and clinical chain-of-thought refinement using real-world case reports. VOLMO-2B achieves an average F1 score of 87.4% across 12 eye diseases, significantly outperforms strong baselines in image captioning and clinical recommendation generation, and demonstrates robust generalization on three independent external cohorts for age-related macular degeneration and diabetic retinopathy.
π Abstract
Vision impairment affects millions globally, and early detection is critical to preventing irreversible vision loss. Ophthalmology workflows require clinicians to integrate medical images, structured clinical data, and free-text notes to determine disease severity and management, which is time-consuming and burdensome. Recent multimodal large language models (MLLMs) show promise, but existing general and medical MLLMs perform poorly in ophthalmology, and few ophthalmology-specific MLLMs are openly available. We present VOLMO (Versatile and Open Large Models for Ophthalmology), a model-agnostic, data-open framework for developing ophthalmology-specific MLLMs. VOLMO includes three stages: ophthalmology knowledge pretraining on 86,965 image-text pairs from 26,569 articles across 82 journals; domain task fine-tuning on 26,929 annotated instances spanning 12 eye conditions for disease screening and severity classification; and multi-step clinical reasoning on 913 patient case reports for assessment, planning, and follow-up care. Using this framework, we trained a compact 2B-parameter MLLM and compared it with strong baselines, including InternVL-2B, LLaVA-Med-7B, MedGemma-4B, MedGemma-27B, and RETFound. We evaluated these models on image description generation, disease screening and staging classification, and assessment-and-management generation, with additional manual review by two healthcare professionals and external validation on three independent cohorts for age-related macular degeneration and diabetic retinopathy. Across settings, VOLMO-2B consistently outperformed baselines, achieving stronger image description performance, an average F1 of 87.4% across 12 eye conditions, and higher scores in external validation.