UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation

๐Ÿ“… 2024-11-25
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address insufficient localization and planning robustness in Vision-and-Language Navigation in Continuous Environments (VLN-CE)โ€”caused by visual occlusion and unstructured pathsโ€”this paper introduces the first 3D Gaussian Splatting (3DGS)-based pretraining paradigm tailored for VLN. Our method jointly optimizes geometry, appearance, and semantics to render high-fidelity 360ยฐ images and dense semantic features in a unified framework. We propose a novel search-then-query sampling strategy and a decoupled-then-unified rendering mechanism, enabling the first fine-grained co-modeling of appearance and high-level semantics for continuous navigation. Integrating NeRF-enhanced sampling with cross-modal alignment, our approach achieves new state-of-the-art performance across mainstream VLN-CE benchmarks, significantly improving both navigation success rate and path fidelity. Crucially, it demonstrates strong generalization to unseen scenes and instructions.

Technology Category

Application Category

๐Ÿ“ Abstract
Vision-and-Language Navigation (VLN), where an agent follows instructions to reach a target destination, has recently seen significant advancements. In contrast to navigation in discrete environments with predefined trajectories, VLN in Continuous Environments (VLN-CE) presents greater challenges, as the agent is free to navigate any unobstructed location and is more vulnerable to visual occlusions or blind spots. Recent approaches have attempted to address this by imagining future environments, either through predicted future visual images or semantic features, rather than relying solely on current observations. However, these RGB-based and feature-based methods lack intuitive appearance-level information or high-level semantic complexity crucial for effective navigation. To overcome these limitations, we introduce a novel, generalizable 3DGS-based pre-training paradigm, called UnitedVLN, which enables agents to better explore future environments by unitedly rendering high-fidelity 360 visual images and semantic features. UnitedVLN employs two key schemes: search-then-query sampling and separate-then-united rendering, which facilitate efficient exploitation of neural primitives, helping to integrate both appearance and semantic information for more robust navigation. Extensive experiments demonstrate that UnitedVLN outperforms state-of-the-art methods on existing VLN-CE benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Enhance navigation in continuous environments using 3DGS-based pre-training.
Integrate appearance and semantic information for robust navigation.
Outperform state-of-the-art methods on VLN-CE benchmarks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generalizable 3DGS-based pre-training paradigm
Search-then-query sampling for neural primitives
Separate-then-united rendering for robust navigation
๐Ÿ”Ž Similar Papers
No similar papers found.
G
Guangzhao Dai
Nanjing University of Science and Technology
J
Jian Zhao
Northwest Polytechnical University
Yuantao Chen
Yuantao Chen
The Chinese University of Hong Kong, Shenzhen
Computer VisionRobotics
Y
Yusen Qin
Tsinghua University
H
Hao Zhao
Tsinghua University
G
Guosen Xie
Nanjing University of Science and Technology
Y
Yazhou Yao
Nanjing University of Science and Technology
X
Xiangbo Shu
Nanjing University of Science and Technology
X
Xuelong Li
Northwest Polytechnical University