MonoDream: Monocular Vision-Language Navigation with Panoramic Dreaming

📅 2025-08-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Monocular vision-and-language navigation (VLN) suffers from limited spatial reasoning due to the absence of panoramic perception. To address this, we propose a lightweight vision-language-action (VLA) framework. Our method introduces: (1) a unified navigation representation space that maps monocular images and language instructions into a shared semantic embedding space; (2) a self-supervised “latent panoramic dreaming” pretraining task, enabling implicit modeling of panoramic scene layout and future state evolution; and (3) joint cross-modal alignment with a lightweight action decoder for efficient end-to-end navigation policy learning. Evaluated on standard VLN benchmarks—R2R and CVDN—our approach achieves state-of-the-art monocular navigation performance, attaining over 90% of the success rate of panoramic RGB-D baselines. This substantially narrows the modality gap and establishes a scalable, resource-efficient paradigm for embodied navigation in constrained environments.

Technology Category

Application Category

📝 Abstract
Vision-Language Navigation (VLN) tasks often leverage panoramic RGB and depth inputs to provide rich spatial cues for action planning, but these sensors can be costly or less accessible in real-world deployments. Recent approaches based on Vision-Language Action (VLA) models achieve strong results with monocular input, yet they still lag behind methods using panoramic RGB-D information. We present MonoDream, a lightweight VLA framework that enables monocular agents to learn a Unified Navigation Representation (UNR). This shared feature representation jointly aligns navigation-relevant visual semantics (e.g., global layout, depth, and future cues) and language-grounded action intent, enabling more reliable action prediction. MonoDream further introduces Latent Panoramic Dreaming (LPD) tasks to supervise the UNR, which train the model to predict latent features of panoramic RGB and depth observations at both current and future steps based on only monocular input. Experiments on multiple VLN benchmarks show that MonoDream consistently improves monocular navigation performance and significantly narrows the gap with panoramic-based agents.
Problem

Research questions and friction points this paper is trying to address.

Enhance monocular VLN without costly panoramic RGB-D sensors
Align visual semantics and language intent for reliable navigation
Predict latent panoramic features from monocular input for planning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Monocular agents learn Unified Navigation Representation
Latent Panoramic Dreaming predicts panoramic features
Aligns visual semantics and language-grounded action intent
🔎 Similar Papers
No similar papers found.
S
Shuo Wang
Renmin University of China
Y
Yongcai Wang
Renmin University of China
W
Wanting Li
Renmin University of China
Yucheng Wang
Yucheng Wang
ETH Zürich
Multimodal LLMSpeech UnderstandingHuman-Computer Interaction
M
Maiyue Chen
Horizon Robotics
K
Kaihui Wang
Horizon Robotics
Zhizhong Su
Zhizhong Su
Horizon Robotics
Deep LearningComputer VisionAutonomous DrivingRobotics Learning
Xudong Cai
Xudong Cai
Renmin University of China
computer visioncamera localizationSLAM
Yeying Jin
Yeying Jin
Tencent | National University of Singapore
Computer VisionAIGCGenAIMLLMVLM
D
Deying Li
Renmin University of China
Z
Zhaoxin Fan
Renmin University of China