Beyond CLIP: Knowledge-Enhanced Multimodal Transformers for Cross-Modal Alignment in Diabetic Retinopathy Diagnosis

📅 2025-12-22

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the severe performance degradation of general-purpose vision-language models (e.g., CLIP) in ophthalmic cross-modal retrieval, this paper proposes a knowledge-enhanced multimodal Transformer framework tailored for diabetic retinopathy (DR) intelligent diagnosis. We innovatively incorporate structured clinical features—introduced for the first time—into cross-modal joint embedding, and design modality-specific encoders alongside a multi-objective collaborative training strategy to bridge the domain gap between generic CLIP and medical cross-modal retrieval. The model integrates ViT-B/16, Bio-ClinicalBERT, and MLP encoders, jointly optimized via contrastive learning, image reconstruction loss, and DR-grade classification loss. On BRSET, it achieves 99.94% text-to-image Recall@1 (vs. 1.29% for CLIP), with grading accuracies of 97.97% and 97.05% on ICDR and SDRG, respectively; zero-shot transfer to DeepEyeNet yields 93.95% Recall@1.

Technology Category

Application Category

📝 Abstract

Diabetic retinopathy (DR) is a leading cause of preventable blindness worldwide, demanding accurate automated diagnostic systems. While general-domain vision-language models like Contrastive Language-Image Pre-Training (CLIP) perform well on natural image tasks, they struggle in medical domain applications, particularly in cross-modal retrieval for ophthalmological images. We propose a novel knowledge-enhanced joint embedding framework that integrates retinal fundus images, clinical text, and structured patient data through a multimodal transformer architecture to address the critical gap in medical image-text alignment. Our approach employs separate encoders for each modality: a Vision Transformer (ViT-B/16) for retinal images, Bio-ClinicalBERT for clinical narratives, and a multilayer perceptron for structured demographic and clinical features. These modalities are fused through a joint transformer with modality-specific embeddings, trained using multiple objectives including contrastive losses between modality pairs, reconstruction losses for images and text, and classification losses for DR severity grading according to ICDR and SDRG schemes. Experimental results on the Brazilian Multilabel Ophthalmological Dataset (BRSET) demonstrate significant improvements over baseline models. Our framework achieves near-perfect text-to-image retrieval performance with Recall@1 of 99.94% compared to fine-tuned CLIP's 1.29%, while maintaining state-of-the-art classification accuracy of 97.05% for SDRG and 97.97% for ICDR. Furthermore, zero-shot evaluation on the unseen DeepEyeNet dataset validates strong generalizability with 93.95% Recall@1 versus 0.22% for fine-tuned CLIP. These results demonstrate that our multimodal training approach effectively captures cross-modal relationships in the medical domain, establishing both superior retrieval capabilities and robust diagnostic performance.

Problem

Research questions and friction points this paper is trying to address.

Enhances cross-modal alignment for diabetic retinopathy diagnosis

Improves medical image-text retrieval beyond general-domain CLIP models

Integrates multimodal data for accurate DR severity classification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal transformer integrates retinal images, clinical text, and structured data

Separate encoders for each modality fused with joint transformer and multiple losses

Achieves near-perfect cross-modal retrieval and state-of-the-art classification accuracy

🔎 Similar Papers

No similar papers found.

Authors to Follow