LookOut: Real-World Humanoid Egocentric Navigation

📅 2025-08-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the long-horizon prediction of 6D head pose (3D translation and 3D rotation) in first-person videos, modeling human active information-seeking navigation behavior. We propose a spatiotemporal reasoning framework that temporally aggregates 3D latent features, jointly encoding static environmental geometry and dynamic semantic constraints to generate collision-free, physically plausible pose sequences. To bridge the lack of real-world first-person navigation data, we introduce the Aria Navigation Dataset (AND), the first large-scale, ego-centric navigation dataset captured using Project Aria smart glasses in diverse real-world settings. Experiments demonstrate that our model generalizes effectively to unseen environments, producing human-like navigation strategies—including deceleration, obstacle avoidance, and traffic-aware scanning—outperforming prior methods by significant margins. The approach shows strong practical potential for applications in humanoid robotics and immersive VR/AR interaction systems.

Technology Category

Application Category

📝 Abstract
The ability to predict collision-free future trajectories from egocentric observations is crucial in applications such as humanoid robotics, VR / AR, and assistive navigation. In this work, we introduce the challenging problem of predicting a sequence of future 6D head poses from an egocentric video. In particular, we predict both head translations and rotations to learn the active information-gathering behavior expressed through head-turning events. To solve this task, we propose a framework that reasons over temporally aggregated 3D latent features, which models the geometric and semantic constraints for both the static and dynamic parts of the environment. Motivated by the lack of training data in this space, we further contribute a data collection pipeline using the Project Aria glasses, and present a dataset collected through this approach. Our dataset, dubbed Aria Navigation Dataset (AND), consists of 4 hours of recording of users navigating in real-world scenarios. It includes diverse situations and navigation behaviors, providing a valuable resource for learning real-world egocentric navigation policies. Extensive experiments show that our model learns human-like navigation behaviors such as waiting / slowing down, rerouting, and looking around for traffic while generalizing to unseen environments. Check out our project webpage at https://sites.google.com/stanford.edu/lookout.
Problem

Research questions and friction points this paper is trying to address.

Predicting future 6D head poses from egocentric video
Modeling geometric and semantic constraints for navigation
Learning human-like navigation behaviors in real-world scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Predicts 6D head poses from egocentric video
Uses 3D latent features for geometric constraints
Collects real-world navigation data via Aria glasses
🔎 Similar Papers
No similar papers found.