🤖 AI Summary
This study evaluates the capabilities of vision-language models (VLMs) in 3D clinical diagnosis of abdominal tumors, focusing on small-lesion detection, 3D anatomical reasoning, and integration of medical knowledge. To this end, we introduce DeepTumorVQA—the first tumor-centric 3D clinical visual question answering (VQA) benchmark—comprising 9,262 CT volumes and 395K expert-annotated questions, and formally define the tumor-centric 3D VQA task. Our methodology features multi-granularity annotations, four diagnostic-oriented question categories (identification, measurement, visual reasoning, and medical reasoning), and a 3D voxel-aware processing paradigm. We conduct cross-model evaluation on state-of-the-art VLMs including RadFM, M3D, Merlin, and CT-CHAT. Results reveal that while current models perform moderately on measurement tasks, they exhibit significant limitations in small-tumor identification and clinical reasoning; RadFM achieves top performance due to its robust multimodal pretraining. The DeepTumorVQA dataset is publicly released to advance rigorous evaluation of medical multimodal AI.
📝 Abstract
Vision-Language Models (VLMs) have shown promise in various 2D visual tasks, yet their readiness for 3D clinical diagnosis remains unclear due to stringent demands for recognition precision, reasoning ability, and domain knowledge. To systematically evaluate these dimensions, we present DeepTumorVQA, a diagnostic visual question answering (VQA) benchmark targeting abdominal tumors in CT scans. It comprises 9,262 CT volumes (3.7M slices) from 17 public datasets, with 395K expert-level questions spanning four categories: Recognition, Measurement, Visual Reasoning, and Medical Reasoning. DeepTumorVQA introduces unique challenges, including small tumor detection and clinical reasoning across 3D anatomy. Benchmarking four advanced VLMs (RadFM, M3D, Merlin, CT-CHAT), we find current models perform adequately on measurement tasks but struggle with lesion recognition and reasoning, and are still not meeting clinical needs. Two key insights emerge: (1) large-scale multimodal pretraining plays a crucial role in DeepTumorVQA testing performance, making RadFM stand out among all VLMs. (2) Our dataset exposes critical differences in VLM components, where proper image preprocessing and design of vision modules significantly affect 3D perception. To facilitate medical multimodal research, we have released DeepTumorVQA as a rigorous benchmark: https://github.com/Schuture/DeepTumorVQA.