ST-Booster: An Iterative SpatioTemporal Perception Booster for Vision-and-Language Navigation in Continuous Environments

📅 2025-04-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In continuous-environment Vision-and-Language Navigation (VLN-CE), two key perceptual degradation issues arise: (1) heterogeneous visual memory and weakened global spatial coherence due to the absence of predefined observation points, and (2) structural noise induced by cumulative errors in 3D reconstruction. To address these, we propose a Multi-Granularity Spatio-Temporal–Instruction Coordinated Perception Framework. Our method introduces an iterative spatio-temporal enhancement paradigm integrating dual-memory encoding (topological and occupancy grid maps), geometry-aware Multi-Granularity Alignment Fusion (MGAF), value-guided waypoint generation (VGWG), Guided Attention Heatmaps (GAHs), Hierarchical Spatio-Temporal Encoding (HSTE), and iterative optimization of pretrained representations. Evaluated under complex perturbations, our approach significantly outperforms state-of-the-art methods, achieving substantial gains in path success rate and navigation efficiency—particularly enhancing long-horizon instruction comprehension and robustness against structural noise.

Technology Category

Application Category

📝 Abstract
Vision-and-Language Navigation in Continuous Environments (VLN-CE) requires agents to navigate unknown, continuous spaces based on natural language instructions. Compared to discrete settings, VLN-CE poses two core perception challenges. First, the absence of predefined observation points leads to heterogeneous visual memories and weakened global spatial correlations. Second, cumulative reconstruction errors in three-dimensional scenes introduce structural noise, impairing local feature perception. To address these challenges, this paper proposes ST-Booster, an iterative spatiotemporal booster that enhances navigation performance through multi-granularity perception and instruction-aware reasoning. ST-Booster consists of three key modules -- Hierarchical SpatioTemporal Encoding (HSTE), Multi-Granularity Aligned Fusion (MGAF), and ValueGuided Waypoint Generation (VGWG). HSTE encodes long-term global memory using topological graphs and captures shortterm local details via grid maps. MGAF aligns these dualmap representations with instructions through geometry-aware knowledge fusion. The resulting representations are iteratively refined through pretraining tasks. During reasoning, VGWG generates Guided Attention Heatmaps (GAHs) to explicitly model environment-instruction relevance and optimize waypoint selection. Extensive comparative experiments and performance analyses are conducted, demonstrating that ST-Booster outperforms existing state-of-the-art methods, particularly in complex, disturbance-prone environments.
Problem

Research questions and friction points this paper is trying to address.

Enhance navigation in continuous spaces using language instructions
Address heterogeneous visual memories and spatial correlation issues
Mitigate structural noise from 3D scene reconstruction errors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical SpatioTemporal Encoding for global and local perception
Multi-Granularity Aligned Fusion for instruction-aware reasoning
ValueGuided Waypoint Generation with Guided Attention Heatmaps
🔎 Similar Papers
No similar papers found.
L
Lu Yue
Department of Advanced Manufacturing and Robotics, and the State Key Laboratory of Turbulence and Complex Systems, College of Engineering, Peking University, Beijing, 100871, China; Defense Innovation Institute, Academy of Military Sciences, Beijing 100071, China; Tianjin Artificial Intelligence Innovation Center, Tianjin 300450, China
D
Dongliang Zhou
Department of Computer Science, Harbin Institute of Technology, Shenzhen, Xili University Town, Shenzhen 518055, China
Liang Xie
Liang Xie
Wuhan University of Technology
Time Series ForecastingCross-modal Learning
E
Erwei Yin
Defense Innovation Institute, Academy of Military Sciences, Beijing 100071, China; Tianjin Artificial Intelligence Innovation Center, Tianjin 300450, China
Feitian Zhang
Feitian Zhang
Associate Professor, Peking University
Underwater VehiclesAerial VehiclesBioinspired RoboticsControl SystemsArtificial Intelligence