Diagnostic Accuracy of Open-Source Vision-Language Models on Diverse Medical Imaging Tasks

📅 2025-08-01

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Despite growing interest in open-source vision-language models (VLMs) for medical imaging, their diagnostic capabilities across diverse clinical tasks remain systematically unassessed. Method: This study conducts the first comprehensive evaluation of five state-of-the-art open-source VLMs—Qwen2.5, Phi-4, Gemma3, Llama3.2, and Mistral3.1—across five challenging medical image analysis tasks: multi-label chest X-ray classification, colon pathology detection, endoscopic lesion identification, neonatal jaundice assessment, and five-grade diabetic retinopathy grading on fundus images. Experiments employ three input–reasoning configurations—unimodal vision-only, multimodal (image + text), and chain-of-thought reasoning—on the MedFMC benchmark, with statistical significance assessed via bootstrap confidence intervals. Results: Qwen2.5 achieves the highest overall performance (90.4% accuracy on chest X-rays; 84.2% on endoscopy), while Phi-4 ties for best on colon pathology and jaundice tasks. All models fail dramatically on retinopathy grading (max 18.6%), and neither multimodal input nor chain-of-thought reasoning yields statistically significant improvements. The study identifies critical capability boundaries and clinical deployment bottlenecks of current open-source VLMs in complex medical image understanding.

Technology Category

Application Category

📝 Abstract

This retrospective study evaluated five VLMs (Qwen2.5, Phi-4, Gemma3, Llama3.2, and Mistral3.1) using the MedFMC dataset. This dataset includes 22,349 images from 7,461 patients encompassing chest radiography (19 disease multi-label classifications), colon pathology (tumor detection), endoscopy (colorectal lesion identification), neonatal jaundice assessment (skin color-based treatment necessity), and retinal fundoscopy (5-point diabetic retinopathy grading). Diagnostic accuracy was compared in three experimental settings: visual input only, multimodal input, and chain-of-thought reasoning. Model accuracy was assessed against ground truth labels, with statistical comparisons using bootstrapped confidence intervals (p<.05). Qwen2.5 achieved the highest accuracy for chest radiographs (90.4%) and endoscopy images (84.2%), significantly outperforming the other models (p<.001). In colon pathology, Qwen2.5 (69.0%) and Phi-4 (69.6%) performed comparably (p=.41), both significantly exceeding other VLMs (p<.001). Similarly, for neonatal jaundice assessment, Qwen2.5 (58.3%) and Phi-4 (58.1%) showed comparable leading accuracies (p=.93) significantly exceeding their counterparts (p<.001). All models struggled with retinal fundoscopy; Qwen2.5 and Gemma3 achieved the highest, albeit modest, accuracies at 18.6% (comparable, p=.99), significantly better than other tested models (p<.001). Unexpectedly, multimodal input reduced accuracy for some models and modalities, and chain-of-thought reasoning prompts also failed to improve accuracy. The open-source VLMs demonstrated promising diagnostic capabilities, particularly in chest radiograph interpretation. However, performance in complex domains such as retinal fundoscopy was limited, underscoring the need for further development and domain-specific adaptation before widespread clinical application.

Problem

Research questions and friction points this paper is trying to address.

Evaluating diagnostic accuracy of open-source vision-language models on medical imaging tasks

Comparing model performance across diverse medical imaging modalities and settings

Identifying limitations in complex domains like retinal fundoscopy for clinical use

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated five open-source vision-language models

Used MedFMC dataset with diverse medical images

Compared accuracy in three experimental settings

🔎 Similar Papers

No similar papers found.

Authors to Follow