FunBench: Benchmarking Fundus Reading Skills of MLLMs

📅 2025-03-02

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing multimodal large language models (MLLMs) lack systematic evaluation for fundus image understanding—current benchmarks suffer from insufficient granularity and fail to decouple visual encoder (VE) and large language model (LLM) components. Method: We introduce FunBench, the first hierarchical visual question answering benchmark tailored for fundus image interpretation, spanning four levels: modality awareness, anatomical identification, lesion analysis, and disease diagnosis. We propose a novel modular evaluation framework integrating VE linear probing, LLM knowledge enhancement, and end-to-end joint fine-tuning, alongside the first clinically grounded task taxonomy. Contribution/Results: Experiments on nine open-source MLLMs and GPT-4o reveal pervasive deficiencies in fundus interpretation: even fundamental tasks such as left/right eye discrimination achieve <60% accuracy, underscoring an urgent need for domain-specific adaptation and evaluation rigor.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have shown significant potential in medical image analysis. However, their capabilities in interpreting fundus images, a critical skill for ophthalmology, remain under-evaluated. Existing benchmarks lack fine-grained task divisions and fail to provide modular analysis of its two key modules, i.e., large language model (LLM) and vision encoder (VE). This paper introduces FunBench, a novel visual question answering (VQA) benchmark designed to comprehensively evaluate MLLMs' fundus reading skills. FunBench features a hierarchical task organization across four levels (modality perception, anatomy perception, lesion analysis, and disease diagnosis). It also offers three targeted evaluation modes: linear-probe based VE evaluation, knowledge-prompted LLM evaluation, and holistic evaluation. Experiments on nine open-source MLLMs plus GPT-4o reveal significant deficiencies in fundus reading skills, particularly in basic tasks such as laterality recognition. The results highlight the limitations of current MLLMs and emphasize the need for domain-specific training and improved LLMs and VEs.

Problem

Research questions and friction points this paper is trying to address.

Evaluates MLLMs' fundus image interpretation skills.

Addresses lack of fine-grained task divisions in benchmarks.

Highlights deficiencies in basic tasks like laterality recognition.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical task organization across four levels

Three targeted evaluation modes for MLLMs

Comprehensive VQA benchmark for fundus images

🔎 Similar Papers

What is the Visual Cognition Gap between Humans and Multimodal LLMs?

2024-06-14arXiv.orgCitations: 15

Authors to Follow