UltraVR: A Diagnostic Ultra-Resolution Image-VQA Benchmark for Evidence-Grounded Reasoning

πŸ“… 2026-06-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

172K/year
πŸ€– AI Summary
This work addresses the limited evidence-driven reasoning capabilities of current vision-language models on super-resolution images and the absence of fine-grained evaluation protocols for their reasoning processes. To this end, the authors propose UltraVR, a diagnostic visual question answering benchmark tailored to four domains: surveillance, remote sensing, histopathology, and industrial inspection. UltraVR introduces, for the first time, a structured chain-of-thought annotation scheme comprising step-level questions, intermediate answers, and reasoning labels, enabling diagnostic analysis across the full pipelineβ€”from evidence localization and local perception to final decision-making. Experimental results reveal that state-of-the-art models perform poorly in super-resolution reasoning, with errors predominantly occurring in the initial two stages; notably, downstream reasoning accuracy improves substantially when intermediate visual facts are explicitly provided.
πŸ“ Abstract
Vision-language models (VLMs) excel on visual question answering and multimodal reasoning benchmarks. Yet their capability on ultra-resolution images - where critical evidence is tiny, subtle, spatially distant, or distributed - remains unclear. Existing evaluations largely report final-answer accuracy, offering limited insight into whether models acquire and integrate the necessary visual evidence. We introduce UltraVR, a diagnostic benchmark for evidence-grounded visual reasoning over ultra-resolution images. UltraVR spans four high-value scenarios: CCTV surveillance, remote sensing (RS), whole-slide image (WSI) pathology, and industrial anomaly detection (AD). These domains pose complementary challenges: fine-grained object grounding in crowded CCTV scenes, long-range spatial comparison in RS, multi-scale evidence navigation in WSI, and subtle irregularity detection in repetitive industrial layouts. Beyond standard QA triples, each instance includes a structured ground-truth chain of thought with step-level questions, intermediate answers, and reasoning labels. These labels decompose reasoning into evidence grounding, local perception, quantification, evidence integration, and decision inference, enabling process-level diagnosis over black-box scoring. Using UltraVR, we evaluate frontier VLMs and show that current models remain far from reliable on ultra-resolution reasoning. Importantly, the structured annotations allow us to localize failures across the visual-to-decision pipeline: errors concentrate in evidence grounding and local perception, while downstream inference often recovers when intermediate visual facts are supplied. These findings demonstrate UltraVR as a diagnostic testbed for measuring not only whether VLMs answer correctly, but where their ultra-resolution reasoning process breaks.
Problem

Research questions and friction points this paper is trying to address.

ultra-resolution images
evidence-grounded reasoning
visual question answering
vision-language models
diagnostic benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

ultra-resolution
evidence-grounded reasoning
structured chain-of-thought
visual question answering
diagnostic benchmark
Gexin Huang
Gexin Huang
University of British Columbia; SCUT; SYSU
Machine LearningDeep LearningBayesian StatisticsMulti-modal LearningElectromagnetic Source
Y
Yanting Yang
University of British Columbia, Vancouver, BC, Canada; Vector Institute, Toronto, ON, Canada
M
Myeongkyun Kang
University of British Columbia, Vancouver, BC, Canada; Vector Institute, Toronto, ON, Canada
B
Beidi Zhao
University of British Columbia, Vancouver, BC, Canada; Vector Institute, Toronto, ON, Canada
Jun Zhou
Jun Zhou
The Hong Kong Polytechnic University
Computer vision6D pose estimationSurgical Scene Understanding
C
Chen Zhou
University of British Columbia, Vancouver, BC, Canada; BC Cancer Agency
G
Gang Wang
University of British Columbia, Vancouver, BC, Canada; BC Cancer Agency
Z
Zu-hua Gao
University of British Columbia, Vancouver, BC, Canada; BC Cancer Agency
Xiaoxiao Li
Xiaoxiao Li
Assistant Professor, UBC; Vector Institute; CIFAR AI Chair; Canada Research Chair
Deep LearningTrustworthy AIAI for Healthcare