🤖 AI Summary
This work addresses the open-set semantic navigation challenge for UAVs in large-scale, unstructured outdoor environments with sparse, long-range targets. Methodologically, we propose a 3D semantic navigation framework featuring a spatially consistent semantic voxel-ray map as persistent memory—integrating short-range voxel search and long-range ray search—and augmenting spatial reasoning with vision-language models (VLMs) to provide cross-modal semantic cues. An adaptive, behavior-tree-driven decision mechanism coordinates reactive responses with global planning. Key contributions include: (i) the first online-updatable 3D semantic memory supporting large-scale open-set navigation; (ii) VLM-enhanced cross-modal spatial reasoning; and (iii) a real-time adaptive behavior-switching strategy. Evaluated across 10 simulated environments and 100 navigation tasks, our method outperforms baselines by 85.25% in success rate and has been successfully deployed and validated in real-world outdoor scenarios.
📝 Abstract
Aerial outdoor semantic navigation requires robots to explore large, unstructured environments to locate target objects. Recent advances in semantic navigation have demonstrated open-set object-goal navigation in indoor settings, but these methods remain limited by constrained spatial ranges and structured layouts, making them unsuitable for long-range outdoor search. While outdoor semantic navigation approaches exist, they either rely on reactive policies based on current observations, which tend to produce short-sighted behaviors, or precompute scene graphs offline for navigation, limiting adaptability to online deployment. We present RAVEN, a 3D memory-based, behavior tree framework for aerial semantic navigation in unstructured outdoor environments. It (1) uses a spatially consistent semantic voxel-ray map as persistent memory, enabling long-horizon planning and avoiding purely reactive behaviors, (2) combines short-range voxel search and long-range ray search to scale to large environments, (3) leverages a large vision-language model to suggest auxiliary cues, mitigating sparsity of outdoor targets. These components are coordinated by a behavior tree, which adaptively switches behaviors for robust operation. We evaluate RAVEN in 10 photorealistic outdoor simulation environments over 100 semantic tasks, encompassing single-object search, multi-class, multi-instance navigation and sequential task changes. Results show RAVEN outperforms baselines by 85.25% in simulation and demonstrate its real-world applicability through deployment on an aerial robot in outdoor field tests.