Constructing Ophthalmic MLLM for Positioning-diagnosis Collaboration Through Clinical Cognitive Chain Reasoning

📅 2025-07-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address fragmented annotation granularity and inconsistent clinical reasoning in ophthalmic multimodal large language models (MLLMs), this paper introduces FundusExpert, a domain-specific ophthalmic MLLM. Methodologically, it proposes: (i) a clinically aligned chain-of-cognition reasoning framework that jointly models lesion localization and diagnostic logic for interpretable cross-modal understanding; (ii) FundusGen, a high-quality fundus image dataset, which—through systematic scaling analysis—first reveals the scaling law linking multimodal ophthalmic data quality to model performance; and (iii) an integrated architecture combining intelligent lesion localization, semantic expansion, instruction tuning, and fine-grained visual feature analysis for end-to-end vision–language co-modeling. Experiments demonstrate that FundusExpert achieves 26.6% higher accuracy than MedRegA-40B on ophthalmic QA tasks and attains 77.0% clinical consistency in zero-shot report generation—significantly outperforming GPT-4o (47.6%).

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) demonstrate significant potential in the field of medical diagnosis. However, they face critical challenges in specialized domains such as ophthalmology, particularly the fragmentation of annotation granularity and inconsistencies in clinical reasoning logic, which hinder precise cross-modal understanding. This paper introduces FundusExpert, an ophthalmology-specific MLLM with integrated positioning-diagnosis reasoning capabilities, along with FundusGen, a dataset constructed through the intelligent Fundus-Engine system. Fundus-Engine automates localization and leverages MLLM-based semantic expansion to integrate global disease classification, local object detection, and fine-grained feature analysis within a single fundus image. Additionally, by constructing a clinically aligned cognitive chain, it guides the model to generate interpretable reasoning paths. FundusExpert, fine-tuned with instruction data from FundusGen, achieves the best performance in ophthalmic question-answering tasks, surpassing the average accuracy of the 40B MedRegA by 26.6%. It also excels in zero-shot report generation tasks, achieving a clinical consistency of 77.0%, significantly outperforming GPT-4o's 47.6%. Furthermore, we reveal a scaling law between data quality and model capability ($L propto N^{0.068}$), demonstrating that the cognitive alignment annotations in FundusGen enhance data utilization efficiency. By integrating region-level localization with diagnostic reasoning chains, our work develops a scalable, clinically-aligned MLLM and explores a pathway toward bridging the visual-language gap in specific MLLMs. Our project can be found at https://github.com/MeteorElf/FundusExpert.
Problem

Research questions and friction points this paper is trying to address.

Address fragmentation in ophthalmology annotation granularity
Resolve inconsistencies in clinical reasoning logic
Bridge visual-language gap in medical MLLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrated positioning-diagnosis reasoning in ophthalmology MLLM
Automated multimodal data fusion via Fundus-Engine system
Clinically-aligned cognitive chain for interpretable reasoning
🔎 Similar Papers
No similar papers found.