🤖 AI Summary
This work addresses the spatiotemporal joint modeling of dynamic real-world environments for embodied agents. To this end, we introduce STRIDE—the first Spatiotemporal Road Image Dataset explicitly designed for exploration and autonomy—built upon 360° panoramic imagery to capture the coupled spatial and temporal evolution of road scenes. We propose a spatiotemporally coupled, graph-structured road observation representation that unifies multi-view, multi-coordinate-system, and action-space observations. Furthermore, we design TARDIS, a Transformer-based architecture enabling instruction-conditioned, unified spatial-temporal autoregressive world modeling. Evaluated on controllable image synthesis, instruction following, autonomous navigation, and georegistration, our approach achieves state-of-the-art performance, significantly enhancing embodied agents’ spatiotemporal understanding of physical environments and their capacity for grounded, physics-aware interaction.
📝 Abstract
World models aim to simulate environments and enable effective agent behavior. However, modeling real-world environments presents unique challenges as they dynamically change across both space and, crucially, time. To capture these composed dynamics, we introduce a Spatio-Temporal Road Image Dataset for Exploration (STRIDE) permuting 360-degree panoramic imagery into rich interconnected observation, state and action nodes. Leveraging this structure, we can simultaneously model the relationship between egocentric views, positional coordinates, and movement commands across both space and time. We benchmark this dataset via TARDIS, a transformer-based generative world model that integrates spatial and temporal dynamics through a unified autoregressive framework trained on STRIDE. We demonstrate robust performance across a range of agentic tasks such as controllable photorealistic image synthesis, instruction following, autonomous self-control, and state-of-the-art georeferencing. These results suggest a promising direction towards sophisticated generalist agents--capable of understanding and manipulating the spatial and temporal aspects of their material environments--with enhanced embodied reasoning capabilities. Training code, datasets, and model checkpoints are made available at https://huggingface.co/datasets/Tera-AI/STRIDE.