TARDIS STRIDE: A Spatio-Temporal Road Image Dataset for Exploration and Autonomy

📅 2025-06-12

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the spatiotemporal joint modeling of dynamic real-world environments for embodied agents. To this end, we introduce STRIDE—the first Spatiotemporal Road Image Dataset explicitly designed for exploration and autonomy—built upon 360° panoramic imagery to capture the coupled spatial and temporal evolution of road scenes. We propose a spatiotemporally coupled, graph-structured road observation representation that unifies multi-view, multi-coordinate-system, and action-space observations. Furthermore, we design TARDIS, a Transformer-based architecture enabling instruction-conditioned, unified spatial-temporal autoregressive world modeling. Evaluated on controllable image synthesis, instruction following, autonomous navigation, and georegistration, our approach achieves state-of-the-art performance, significantly enhancing embodied agents’ spatiotemporal understanding of physical environments and their capacity for grounded, physics-aware interaction.

Technology Category

Application Category

📝 Abstract

World models aim to simulate environments and enable effective agent behavior. However, modeling real-world environments presents unique challenges as they dynamically change across both space and, crucially, time. To capture these composed dynamics, we introduce a Spatio-Temporal Road Image Dataset for Exploration (STRIDE) permuting 360-degree panoramic imagery into rich interconnected observation, state and action nodes. Leveraging this structure, we can simultaneously model the relationship between egocentric views, positional coordinates, and movement commands across both space and time. We benchmark this dataset via TARDIS, a transformer-based generative world model that integrates spatial and temporal dynamics through a unified autoregressive framework trained on STRIDE. We demonstrate robust performance across a range of agentic tasks such as controllable photorealistic image synthesis, instruction following, autonomous self-control, and state-of-the-art georeferencing. These results suggest a promising direction towards sophisticated generalist agents--capable of understanding and manipulating the spatial and temporal aspects of their material environments--with enhanced embodied reasoning capabilities. Training code, datasets, and model checkpoints are made available at https://huggingface.co/datasets/Tera-AI/STRIDE.

Problem

Research questions and friction points this paper is trying to address.

Modeling dynamic real-world environments across space and time

Integrating spatial and temporal dynamics for agent behavior

Enhancing embodied reasoning in generalist agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatio-temporal road image dataset for dynamic modeling

Transformer-based generative world model integrating dynamics

Unified autoregressive framework for spatial-temporal tasks

🔎 Similar Papers

ROADWork Dataset: Learning to Recognize, Observe, Analyze and Drive Through Work Zones