🤖 AI Summary
Existing multimodal large language models (MLLMs) lack systematic evaluation for fundus image understanding—current benchmarks suffer from insufficient granularity and fail to decouple visual encoder (VE) and large language model (LLM) components. Method: We introduce FunBench, the first hierarchical visual question answering benchmark tailored for fundus image interpretation, spanning four levels: modality awareness, anatomical identification, lesion analysis, and disease diagnosis. We propose a novel modular evaluation framework integrating VE linear probing, LLM knowledge enhancement, and end-to-end joint fine-tuning, alongside the first clinically grounded task taxonomy. Contribution/Results: Experiments on nine open-source MLLMs and GPT-4o reveal pervasive deficiencies in fundus interpretation: even fundamental tasks such as left/right eye discrimination achieve <60% accuracy, underscoring an urgent need for domain-specific adaptation and evaluation rigor.
📝 Abstract
Multimodal Large Language Models (MLLMs) have shown significant potential in medical image analysis. However, their capabilities in interpreting fundus images, a critical skill for ophthalmology, remain under-evaluated. Existing benchmarks lack fine-grained task divisions and fail to provide modular analysis of its two key modules, i.e., large language model (LLM) and vision encoder (VE). This paper introduces FunBench, a novel visual question answering (VQA) benchmark designed to comprehensively evaluate MLLMs' fundus reading skills. FunBench features a hierarchical task organization across four levels (modality perception, anatomy perception, lesion analysis, and disease diagnosis). It also offers three targeted evaluation modes: linear-probe based VE evaluation, knowledge-prompted LLM evaluation, and holistic evaluation. Experiments on nine open-source MLLMs plus GPT-4o reveal significant deficiencies in fundus reading skills, particularly in basic tasks such as laterality recognition. The results highlight the limitations of current MLLMs and emphasize the need for domain-specific training and improved LLMs and VEs.