UCSF-PDGM-VQA: Visual Question Answering dataset for brain tumor MRI interpretation

πŸ“… 2026-05-16
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

216K/year
πŸ€– AI Summary
This study addresses the absence of clinical-grade evaluation benchmarks for vision-language models in neuro-oncology, particularly for interpreting multi-sequence, three-dimensional brain MRI scansβ€”a gap that hinders their reliable deployment in real-world clinical settings. To bridge this, the authors introduce UCSF-PDGM-VQA, the first visual question answering (VQA) dataset tailored to glioma MRI interpretation, built upon the publicly available UCLA-PDGM dataset and comprising 2,387 structured question-answer pairs from 473 patients. The work presents a systematic evaluation of six state-of-the-art vision-language models alongside one large language model, revealing a pervasive modality collapse: models excessively rely on linguistic priors and fail to effectively integrate multimodal 3D imaging information. These findings underscore critical limitations in safety and robustness for clinical deployment and establish a foundational benchmark to guide the development of domain-specific models.
πŸ“ Abstract
Brain tumor diagnosis is largely dependent on Magnetic Resonance Imaging (MRI) evaluation, which requires radiologists to synthesize thousands of images across multiple 3D sequences and longitudinal studies. This process requires advanced neuro-radiology training, poses substantial cognitive load, and is highly time-consuming. Despite increasing demands in radiology, this expertise is difficult to scale, straining the current health systems. Vision-Language Models (VLMs) provide an opportunity to reduce this burden through a semi-automated, interactive interpretation of complex brain MRIs. However, they are currently underutilized in neuro-oncology due to a lack of specialized benchmarks for evaluating them. We introduce a clinically relevant visual question answering (VQA) benchmark -- the UCSF-PDGM-VQA dataset -- consisting of 2,387 QA pairs from 473 glioma-related MRI studies in the public UCSF-PDGM dataset. We further establish a performance baseline for six state-of-the-art vision-language models (VLMs) and one large language model on this dataset. We find that current models are incapable of effectively processing multi-sequence, 3-dimensional MRI scans, thus resulting in a suppression of visual features and over-reliance on language priors, causing modality collapse. These findings underscore a critical deficiency in current model reliability and safety within clinical settings, necessitating the development of robust, domain-specific VLMs.
Problem

Research questions and friction points this paper is trying to address.

Visual Question Answering
Brain Tumor MRI
Vision-Language Models
Clinical Benchmark
Neuro-oncology
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Question Answering
Vision-Language Models
Brain Tumor MRI
Modality Collapse
Clinical Benchmark
S
Shiv Ghosh
Fung Institute for Engineering Leadership, University of California, Berkeley
J
Junayd Lateef
Fung Institute for Engineering Leadership, University of California, Berkeley
C
Chih-Hua Liu
Fung Institute for Engineering Leadership, University of California, Berkeley
Yannan Yu
Yannan Yu
Stanford Univeristy
Neurologystrokeartificial intelligence
A
Andreas M. Rauschecker
Department of Radiology, University of California, San Francisco
M
Madhumita Sushil
Division of Clinical Informatics and Digital Transformation, Department of Neurological Surgery, University of California, San Francisco