Hierarchical Semantic-Augmented Navigation: Optimal Transport and Graph-Driven Reasoning for Vision-Language Navigation

📅 2026-05-31

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the challenge of long-horizon task failure in vision-and-language navigation within continuous environments, which stems from insufficient scene understanding, inefficient planning, and weak decision-making. To overcome these limitations, the authors propose a graph-driven end-to-end navigation framework that constructs a dynamic hierarchical semantic scene graph. This framework integrates an optimal transport–based (Kantorovich dual) target selection mechanism with a spectral graph theory–guided graph-aware reinforcement learning policy, enabling seamless unification from high-level semantic reasoning to low-level action execution. Evaluated on multiple VLN-CE benchmarks, the method achieves state-of-the-art performance, significantly improving both navigation success rates and generalization to unseen environments.

📝 Abstract

Vision-Language Navigation in Continuous Environments (VLN-CE) poses a formidable challenge for autonomous agents, requiring seamless integration of natural language instructions and visual observations to navigate complex 3D indoor spaces. Existing approaches often falter in long-horizon tasks due to limited scene understanding, inefficient planning, and lack of robust decision-making frameworks. We introduce the \textbf{Hierarchical Semantic-Augmented Navigation (HSAN)} framework, a groundbreaking approach that redefines VLN-CE through three synergistic innovations. First, HSAN constructs a dynamic hierarchical semantic scene graph, leveraging vision-language models to capture multi-level environmental representations, from objects to regions to zones, enabling nuanced spatial reasoning. Second, it employs an optimal transport-based topological planner, grounded in Kantorovich's duality, to select long-term goals by balancing semantic relevance and spatial accessibility with theoretical guarantees of optimality. Third, a graph-aware reinforcement learning policy ensures precise low-level control, navigating subgoals while robustly avoiding obstacles. By integrating spectral graph theory, optimal transport, and advanced multi-modal learning, HSAN addresses the shortcomings of static maps and heuristic planners prevalent in prior work. Extensive experiments on multiple challenging VLN-CE datasets demonstrate that HSAN achieves state-of-the-art performance, with significant improvements in navigation success and generalization to unseen environments.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Navigation

Continuous Environments

Long-horizon Navigation

Scene Understanding

Autonomous Agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Semantic Scene Graph

Optimal Transport

Graph-Based Reinforcement Learning