Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the geometric ambiguity in spatial reasoning caused by limited viewpoints in egocentric videos by proposing a training-free, verifiable spatial reasoning framework. The method first generates initial spatial hypotheses from the original video and then leverages a multimodal large language model in conjunction with a Geometry-to-Video rendering pipeline to synthesize complementary novel-view videos featuring top-down oblique angles and panoramic coverage. Cross-view information is dynamically employed to verify and iteratively refine these hypotheses, thereby overcoming the limitations of single-pass inference. Experiments on VSI-Bench and STI-Bench demonstrate that the proposed approach substantially enhances the spatial reasoning capabilities of open-source multimodal large language models, achieving performance on par with state-of-the-art closed-source counterparts.

📝 Abstract

Spatial reasoning from egocentric videos is inherently challenging because the observable evidence is constrained by the camera trajectory. Existing methods rely on single-turn inference, forcing models to resolve geometric ambiguity through semantic priors rather than verifiable evidence. We argue that spatial reasoning should be revisitable: conclusions formed under limited evidence should remain open to revision when complementary viewpoints become available. Building on this insight, we propose Reason, then Re-reason (ReRe), a training-free, inference-time framework with two phases: in the Reason Phase, an MLLM forms a spatial hypothesis from the original video; in the Re-reason Phase, it verifies or revises the hypothesis by observing a synthesized novel-view video. To enable effective cross-view revisiting, we design a Geometry-to-Video pipeline that renders strategically complementary novel views from predicted 3D geometry. These views feature an elevated, oblique perspective with scene-spanning coverage, while preserving the MLLM's native video interface without architectural modifications. Extensive evaluations on VSI-Bench and STI-Bench demonstrate that ReRe substantially boosts open-source MLLMs to rival proprietary state-of-the-art performance. Project page: https://zhenjiemao.github.io/ReRe/

Problem

Research questions and friction points this paper is trying to address.

spatial reasoning

egocentric videos

geometric ambiguity

cross-view revisiting

novel-view synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial reasoning

cross-view revisiting

novel-view synthesis