Out of Sight, Not Out of Context? Egocentric Spatial Reasoning in VLMs Across Disjoint Frames

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of egocentric 3D spatial reasoning across non-co-occurring frames in vision-language models (VLMs). To this end, we introduce Disjoint-3DQA—the first generative, non-co-occurring 3D spatial question-answering benchmark—built upon egocentric videos. It systematically evaluates models’ ability to infer inter-frame object spatial relationships via trajectory prompting, bird’s-eye-view projection, and ablation with oracle 3D coordinates. Experiments on seven state-of-the-art VLMs reveal: (1) average accuracy is 28% lower than human performance; (2) accuracy drops sharply by 30% as temporal separation increases; and (3) incorporating oracle 3D coordinates improves accuracy by 20%, underscoring the critical role of geometric priors. Disjoint-3DQA establishes a novel evaluation paradigm and diagnostic toolkit for temporal 3D scene understanding, enabling fine-grained analysis of spatiotemporal reasoning deficits in VLMs.

Technology Category

Application Category

📝 Abstract
An embodied AI assistant operating on egocentric video must integrate spatial cues across time - for instance, determining where an object A, glimpsed a few moments ago lies relative to an object B encountered later. We introduce Disjoint-3DQA , a generative QA benchmark that evaluates this ability of VLMs by posing questions about object pairs that are not co-visible in the same frame. We evaluated seven state-of-the-art VLMs and found that models lag behind human performance by 28%, with steeper declines in accuracy (60% to 30 %) as the temporal gap widens. Our analysis further reveals that providing trajectories or bird's-eye-view projections to VLMs results in only marginal improvements, whereas providing oracle 3D coordinates leads to a substantial 20% performance increase. This highlights a core bottleneck of multi-frame VLMs in constructing and maintaining 3D scene representations over time from visual signals. Disjoint-3DQA therefore sets a clear, measurable challenge for long-horizon spatial reasoning and aims to catalyze future research at the intersection of vision, language, and embodied AI.
Problem

Research questions and friction points this paper is trying to address.

Evaluating VLMs' spatial reasoning across disjoint frames
Assessing performance gap between VLMs and humans
Identifying bottlenecks in 3D scene representation construction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Disjoint-3DQA benchmark tests VLM spatial reasoning
Trajectories and projections yield minimal improvements
Oracle 3D coordinates boost performance significantly
🔎 Similar Papers
No similar papers found.