🤖 AI Summary
This work addresses the limitations of vision-and-language navigation (VLN)—notably data scarcity and insufficient simulation fidelity—that hinder generalization to complex, long-horizon tasks. The authors propose a unified VLN paradigm comprising a large-scale GN-Matrix dataset and a high-fidelity interactive simulator based on 3D Gaussian Splatting (3DGS). They introduce GN-BAE, an end-to-end foundation model that integrates reinforcement learning with DAgger for policy learning, and pioneer the use of bird’s-eye-view (BEV) representations as a compact memory mechanism to enhance spatial reasoning in vision-language models. To support comprehensive evaluation, they release GN-Bench—the first BEV-based VLN benchmark—and dynamic 3DGS avatars. Experiments demonstrate that the proposed approach significantly outperforms state-of-the-art methods on both GN-Bench and VLN-CE, excelling across diverse tasks including instruction following, goal-oriented navigation, and human-following scenarios.
📝 Abstract
Embodied navigation connects intelligent agents with the physical world and is fundamental for general robotic intelligence. Limited availability and quality of navigation data have constrained Vision-and-Language Navigation (VLN) systems' generalization and long-horizon capabilities. To address this, we curate diverse 3D scenes and develop an automated pipeline for large-scale navigation data, resulting in the GN-Matrix dataset. Building on a 3D Gaussian Splatting (3DGS) engine, we introduce a high-fidelity simulation platform supporting interactive roaming and collision-aware navigation. We further propose GN-Bench, the first BEV-based benchmark incorporating dynamic 3DGS avatars for human-robot interaction evaluation. To leverage the simulator, we develop an RL-driven navigation foundation model, Break and Establish (BAE). After supervised learning, DAgger exposes the model to rollout-induced states, breaking narrow expert-centric distributions and enabling downstream RL exploration. This unified VLN paradigm integrates map-based and map-free tasks, including instruction following, human following, and goal navigation. GN-BAE formalizes high-fidelity 3DGS-rendered Bird's Eye View representations as compact memory, unlocking latent spatial reasoning in VLMs. Extensive evaluations on GN-Bench and VLN-CE show that GN0 outperforms state-of-the-art VLN methods. Overall, GN-Matrix offers a unified framework spanning data, simulation, and learning, advancing embodied navigation in research and industrial applications.