🤖 AI Summary
This work addresses the lack of egocentric 3D spatial reasoning across non-co-occurring frames in vision-language models (VLMs). To this end, we introduce Disjoint-3DQA—the first generative, non-co-occurring 3D spatial question-answering benchmark—built upon egocentric videos. It systematically evaluates models’ ability to infer inter-frame object spatial relationships via trajectory prompting, bird’s-eye-view projection, and ablation with oracle 3D coordinates. Experiments on seven state-of-the-art VLMs reveal: (1) average accuracy is 28% lower than human performance; (2) accuracy drops sharply by 30% as temporal separation increases; and (3) incorporating oracle 3D coordinates improves accuracy by 20%, underscoring the critical role of geometric priors. Disjoint-3DQA establishes a novel evaluation paradigm and diagnostic toolkit for temporal 3D scene understanding, enabling fine-grained analysis of spatiotemporal reasoning deficits in VLMs.
📝 Abstract
An embodied AI assistant operating on egocentric video must integrate spatial cues across time - for instance, determining where an object A, glimpsed a few moments ago lies relative to an object B encountered later. We introduce Disjoint-3DQA , a generative QA benchmark that evaluates this ability of VLMs by posing questions about object pairs that are not co-visible in the same frame. We evaluated seven state-of-the-art VLMs and found that models lag behind human performance by 28%, with steeper declines in accuracy (60% to 30 %) as the temporal gap widens. Our analysis further reveals that providing trajectories or bird's-eye-view projections to VLMs results in only marginal improvements, whereas providing oracle 3D coordinates leads to a substantial 20% performance increase. This highlights a core bottleneck of multi-frame VLMs in constructing and maintaining 3D scene representations over time from visual signals. Disjoint-3DQA therefore sets a clear, measurable challenge for long-horizon spatial reasoning and aims to catalyze future research at the intersection of vision, language, and embodied AI.