SNOW: Spatio-Temporal Scene Understanding with World Knowledge for Open-World Embodied Reasoning

๐Ÿ“… 2025-12-18
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Autonomous robots require 4D scene understanding that jointly encodes semantics, geometry, and temporal dynamics. Existing vision-language models (VLMs) lack 3D-temporal grounding, while geometric methods suffer from semantic sparsity. To address this, we propose a training- and backbone-agnostic paradigm for constructing 4D Scene Graphs (4DSG), introducing two key innovations: STEPโ€”a spatio-temporal tokenization encoding schemeโ€”and SLAM-anchored spatial alignment, enabling the first seamless joint grounding of VLM semantics, point-cloud geometry, and temporal consistency. Our method integrates SAM2-based segmentation, HDBSCAN clustering, a lightweight SLAM backend, and incremental multimodal token fusion to build a queryable, unified world model. Evaluated on multiple benchmarks, it achieves state-of-the-art performance, significantly improving spatial grounding accuracy and cross-temporal scene understanding. This work provides structured 4D priors essential for open-world embodied reasoning.

Technology Category

Application Category

๐Ÿ“ Abstract
Autonomous robotic systems require spatio-temporal understanding of dynamic environments to ensure reliable navigation and interaction. While Vision-Language Models (VLMs) provide open-world semantic priors, they lack grounding in 3D geometry and temporal dynamics. Conversely, geometric perception captures structure and motion but remains semantically sparse. We propose SNOW (Scene Understanding with Open-World Knowledge), a training-free and backbone-agnostic framework for unified 4D scene understanding that integrates VLM-derived semantics with point cloud geometry and temporal consistency. SNOW processes synchronized RGB images and 3D point clouds, using HDBSCAN clustering to generate object-level proposals that guide SAM2-based segmentation. Each segmented region is encoded through our proposed Spatio-Temporal Tokenized Patch Encoding (STEP), producing multimodal tokens that capture localized semantic, geometric, and temporal attributes. These tokens are incrementally integrated into a 4D Scene Graph (4DSG), which serves as 4D prior for downstream reasoning. A lightweight SLAM backend anchors all STEP tokens spatially in the environment, providing the global reference alignment, and ensuring unambiguous spatial grounding across time. The resulting 4DSG forms a queryable, unified world model through which VLMs can directly interpret spatial scene structure and temporal dynamics. Experiments on a diverse set of benchmarks demonstrate that SNOW enables precise 4D scene understanding and spatially grounded inference, thereby setting new state-of-the-art performance in several settings, highlighting the importance of structured 4D priors for embodied reasoning and autonomous robotics.
Problem

Research questions and friction points this paper is trying to address.

Integrates semantic priors with 3D geometry and temporal dynamics for scene understanding
Unifies multimodal data into a queryable 4D scene graph for embodied reasoning
Enables spatially grounded inference in dynamic environments for autonomous robotics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates VLM semantics with point cloud geometry and temporal consistency.
Uses HDBSCAN clustering and SAM2 segmentation for object-level proposals.
Builds a queryable 4D Scene Graph with Spatio-Temporal Tokenized Patch Encoding.
๐Ÿ”Ž Similar Papers
No similar papers found.
T
Tin Stribor Sohn
Karlsruhe Institute of Technology
M
Maximilian Dillitzer
Esslingen University of Applied Sciences
J
Jason J. Corso
University of Michigan
Eric Sax
Eric Sax
Karlsruhe Institute for Technology
Systems Engineering