Consistent Yet Wrong: Evidence Insensitivity in Spatial Vision-Language Models

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses the unreliable performance of current vision-language models in metric distance queries, where cross-view consistency is often mistakenly interpreted as evidence of geometric understanding. The authors propose ViewDiag, a novel evaluation framework that distinguishes for the first time between decision collapse and representation collapse. By constructing a multi-view dataset based on Hypersim, ScanNet, and KITTI360, and integrating metrics such as metric accuracy, distribution concentration, and feature probing, the study reveals that dominant models exhibit high cross-view consistency yet low metric accuracy. This discrepancy indicates that spatial reasoning in these models relies predominantly on learned priors rather than visual evidence, thereby challenging the common assumption that cross-view consistency serves as a valid proxy for geometric understanding.

📝 Abstract

Spatial reasoning is fundamental to robotics, autonomy, and embodied AI, yet modern vision-language models (VLMs) remain unreliable on metric distance queries. A common assumption is that consistent predictions across viewpoints reflect geometric grounding. We test this assumption and find the opposite: leading VLMs often produce view-invariant and consistent answers even when those answers are incorrect, indicating weak coupling between predictions and viewpoint-specific visual evidence. We introduce \textbf{ViewDiag}, a controlled multi-view evaluation protocol built from Hypersim, ScanNet, and KITTI360, comprising 176 object-pair tracks across 80 scenes with 2--10 views per track. The protocol evaluates models along three axes: metric accuracy, distributional concentration, and a latent feature probe for internal collapse that distinguishes decision collapse from representation collapse. Across diverse models, we observe a consistent pattern of high prediction stability paired with substantial error, clustering in a regime characterized by strong consistency but low accuracy. \noindent These results challenge the common use of cross-view consistency as a proxy for geometric understanding. Instead, we show that stable predictions may reflect prior-driven collapse rather than evidence-sensitive reasoning. ViewDiag provides a controlled benchmark and diagnostic framework for evaluating spatial VLMs beyond accuracy alone. The code and data can be found \href{https://github.com/SDivakarBhat/Consistent_Yet_Wrong.git}{here}

Problem

Research questions and friction points this paper is trying to address.

spatial reasoning

vision-language models

metric distance

viewpoint consistency

evidence insensitivity

Innovation

Methods, ideas, or system contributions that make the work stand out.

ViewDiag

vision-language models

spatial reasoning