Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering

📅 2025-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study evaluates the capabilities of vision-language models (VLMs) in 3D clinical diagnosis of abdominal tumors, focusing on small-lesion detection, 3D anatomical reasoning, and integration of medical knowledge. To this end, we introduce DeepTumorVQA—the first tumor-centric 3D clinical visual question answering (VQA) benchmark—comprising 9,262 CT volumes and 395K expert-annotated questions, and formally define the tumor-centric 3D VQA task. Our methodology features multi-granularity annotations, four diagnostic-oriented question categories (identification, measurement, visual reasoning, and medical reasoning), and a 3D voxel-aware processing paradigm. We conduct cross-model evaluation on state-of-the-art VLMs including RadFM, M3D, Merlin, and CT-CHAT. Results reveal that while current models perform moderately on measurement tasks, they exhibit significant limitations in small-tumor identification and clinical reasoning; RadFM achieves top performance due to its robust multimodal pretraining. The DeepTumorVQA dataset is publicly released to advance rigorous evaluation of medical multimodal AI.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) have shown promise in various 2D visual tasks, yet their readiness for 3D clinical diagnosis remains unclear due to stringent demands for recognition precision, reasoning ability, and domain knowledge. To systematically evaluate these dimensions, we present DeepTumorVQA, a diagnostic visual question answering (VQA) benchmark targeting abdominal tumors in CT scans. It comprises 9,262 CT volumes (3.7M slices) from 17 public datasets, with 395K expert-level questions spanning four categories: Recognition, Measurement, Visual Reasoning, and Medical Reasoning. DeepTumorVQA introduces unique challenges, including small tumor detection and clinical reasoning across 3D anatomy. Benchmarking four advanced VLMs (RadFM, M3D, Merlin, CT-CHAT), we find current models perform adequately on measurement tasks but struggle with lesion recognition and reasoning, and are still not meeting clinical needs. Two key insights emerge: (1) large-scale multimodal pretraining plays a crucial role in DeepTumorVQA testing performance, making RadFM stand out among all VLMs. (2) Our dataset exposes critical differences in VLM components, where proper image preprocessing and design of vision modules significantly affect 3D perception. To facilitate medical multimodal research, we have released DeepTumorVQA as a rigorous benchmark: https://github.com/Schuture/DeepTumorVQA.
Problem

Research questions and friction points this paper is trying to address.

Evaluating VLMs for 3D clinical tumor diagnosis accuracy
Assessing VLMs on small tumor detection and reasoning
Benchmarking VLMs' performance in medical visual question answering
Innovation

Methods, ideas, or system contributions that make the work stand out.

DeepTumorVQA benchmark for 3D tumor diagnosis
Multimodal pretraining enhances VLM performance
Image preprocessing improves 3D perception
🔎 Similar Papers
No similar papers found.
Yixiong Chen
Yixiong Chen
Johns Hopkins University
Vision Language ModelsComputer VisionMedical Image Analysis
W
Wenjie Xiao
Johns Hopkins University
P
P. R. Bassi
University of Bologna, Center for Biomolecular Nanotechnologies, Istituto Italiano di Tecnologia
X
Xinze Zhou
Johns Hopkins University
S
Sezgin Er
Istanbul Medipol University
Ibrahim Ethem Hamamci
Ibrahim Ethem Hamamci
MD-PhD Student at University of Zurich | ETH AI Center
Medical Image AnalysisMachine Learning
Z
Zongwei Zhou
Johns Hopkins University
A
Alan L. Yuille
Johns Hopkins University