EagleVision: A Dual-Stage Framework with BEV-grounding-based Chain-of-Thought for Spatial Intelligence

📅 2025-12-17

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing spatial intelligence methods face three key challenges: weak spatial consistency in 3D–2D fusion, insufficient viewpoint diversity, and non-auditable evidence chains. Moreover, multi-step reasoning frameworks lack global spatial awareness modeling, explicit alignment between 3D hypotheses and frame-level visual evidence, and differentiable spatial grounding rewards. To address these, we propose a two-stage spatial cognition framework. In the macro stage, we employ a semantic–viewpoint-fused Determinantal Point Process (SPF-DPP) to compress long videos into geometry-semantic keyframes. In the micro stage, we model pose queries on the Bird’s-Eye View (BEV) plane and integrate multi-view consistency feedback via reinforcement learning with a differentiable spatial grounding reward, enabling traceable verification. We introduce the first BEV-grounded Chain-of-Thought paradigm, achieving state-of-the-art performance on VSI-Bench—outperforming open-source vision-language models across spatial consistency, viewpoint diversity, and evidence traceability.

Technology Category

Application Category

📝 Abstract

Recent spatial intelligence approaches typically attach 3D cues to 2D reasoning pipelines or couple MLLMs with black-box reconstruction modules, leading to weak spatial consistency, limited viewpoint diversity, and evidence chains that cannot be traced back to supporting views. Frameworks for "thinking with images" (e.g., ChatGPT-o3 and DeepEyes) show that stepwise multimodal reasoning can emerge by interleaving hypothesis formation with active acquisition of visual evidence, but they do not address three key challenges in spatial Chain-of-Thought (CoT): building global space perception under strict token budgets, explicitly associating 3D hypotheses with video frames for verification, and designing spatially grounded rewards for reinforcement learning. To address these issues, we present EagleVision, a dual-stage framework for progressive spatial cognition through macro perception and micro verification. In the macro perception stage, EagleVision employs a semantics-perspective-fusion determinantal point process (SPF-DPP) to select a compact set of geometry- and semantics-aware keyframes from long videos under a fixed token budget. In the micro verification stage, we formalize spatial CoT as BEV-grounded pose querying: the agent iteratively predicts poses on a BEV plane, retrieves the nearest real frames, and is trained purely by reinforcement learning with a spatial grounding reward that scores the consistency between predicted poses and observed views. On VSI-Bench, EagleVision achieves state-of-the-art performance among open-source vision-language models, demonstrating strong and generalizable spatial understanding.

Problem

Research questions and friction points this paper is trying to address.

Addresses spatial reasoning with limited token budgets and viewpoint diversity.

Links 3D hypotheses to video frames for explicit verification.

Designs spatially grounded rewards for reinforcement learning in spatial tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

SPF-DPP selects keyframes under token budget

BEV-grounded pose querying formalizes spatial Chain-of-Thought

Spatial grounding reward trains agent via reinforcement learning

🔎 Similar Papers

No similar papers found.

Authors to Follow