🤖 AI Summary
Existing vision-and-language navigation (VLN) methods often predict waypoints in isolation, leading to unreachable targets or misalignment between high-level planning and low-level control. To address this, this work proposes a trajectory-centric waypoint paradigm that couples waypoint prediction with executable trajectory generation for the first time. Specifically, each candidate waypoint is anchored to a collision-free trajectory produced by a TSDF-guided diffusion policy, ensuring both geometric feasibility and semantic consistency between high-level decision-making and low-level execution. Experimental results on the VLN-CE benchmark demonstrate that the proposed approach significantly outperforms current state-of-the-art baselines, validating the effectiveness and superiority of the introduced paradigm.
📝 Abstract
Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural-language instructions while navigating in real-world-like environments. Most VLN-CE approach\-es adopt a three-stage framework: a waypoint predictor proposes navigable waypoints, and a navigator selects the best waypoint, with a low-level controller executing the movement to it. However, this decoupled paradigm often leads to unreachable waypoints or inconsistencies between planning and control. In this work, instead of predicting isolated waypoints, we introduce a novel paradigm called Trajectory Waypoint, which grounds each candidate waypoint in an executable trajectory. To realize this, we design a Trajectory Waypoint Predictor formulated as a TSDF-guided diffusion policy, which steers trajectory generation away from obstacles, inherently ensuring the reachability of the predicted waypoints. We further propose a trajectory-enhanced navigator that injects the associated trajectory as additional information for planning, enabling strict consistency between high-level semantic decisions and low-level execution. Extensive experiments on the VLN-CE benchmark show that our Trajectory Waypoint paradigm achieves superior performance over the baselines.