NavWAM: A Navigation World Action Model for Goal-Conditioned Visual Navigation

📅 2026-06-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of mapping visual predictions to closed-loop control in goal-oriented visual navigation under partial observability by proposing NavWAM, an end-to-end policy model based on a diffusion Transformer. NavWAM is the first approach to jointly model visual prediction and action policy within a navigation world model, unifying the encoding of future observations, goal progress values, and action chunks into a shared latent sequence. This unified representation enables synergistic optimization of prediction, value estimation, and action generation. Notably, NavWAM operates without external planners or action search mechanisms and consistently outperforms existing planning-based world models on both offline benchmarks and real-world robotic evaluations, achieving superior navigation performance even in its default policy mode.

📝 Abstract

Goal-conditioned visual navigation requires a robot to act under partial observability by anticipating how its motion will change the future egocentric view and whether that change brings it closer to the goal. Navigation world models provide such visual foresight, but they remain prediction modules that require an external planner to convert predicted futures into closed-loop control. We propose Navigation World Action Model (NavWAM), a diffusion-transformer policy that turns navigation world-model prediction into executable action by representing future observations, goal-progress values, and action chunks in a shared latent sequence. By learning future prediction jointly with the action and value targets that determine closed-loop behavior, NavWAM makes visual foresight directly usable for robot control. We build NavWAM through simulation pretraining and real-robot adaptation, and evaluate it on image-goal navigation against planning-based world models and a representative direct navigation policy. Across offline benchmarks and closed-loop real-robot deployment, NavWAM improves over planning-based world-model baselines in our evaluations while using the default policy mode without CEM-style action search. Project page: https://dachii-azm.github.io/navwam/

Problem

Research questions and friction points this paper is trying to address.

goal-conditioned visual navigation

navigation world models

visual foresight

closed-loop control

partial observability

Innovation

Methods, ideas, or system contributions that make the work stand out.

world model

goal-conditioned navigation

diffusion transformer