UAV-VLN: End-to-End Vision Language guided Navigation for UAVs

📅 2025-04-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of enabling unmanned aerial vehicles (UAVs) to perform robust vision-and-language navigation (VLN) in unknown indoor and outdoor environments guided by natural language instructions. Methodologically, we propose the first end-to-end VLN framework tailored for aerial platforms, integrating a large language model with a multi-scale visual encoder and introducing an interpretable cross-modal grounding mechanism for fine-grained alignment between linguistic intent and visual scenes. Our approach jointly incorporates semantic object detection, spatial relation reasoning, and end-to-end trajectory planning—requiring only minimal task-specific annotations while generalizing effectively to novel instructions and unseen environments. Experiments demonstrate significant improvements in instruction-following accuracy and trajectory efficiency, with strong generalization across diverse scenarios, guaranteed flight safety, and intuitive human–UAV interaction.

Technology Category

Application Category

📝 Abstract
A core challenge in AI-guided autonomy is enabling agents to navigate realistically and effectively in previously unseen environments based on natural language commands. We propose UAV-VLN, a novel end-to-end Vision-Language Navigation (VLN) framework for Unmanned Aerial Vehicles (UAVs) that seamlessly integrates Large Language Models (LLMs) with visual perception to facilitate human-interactive navigation. Our system interprets free-form natural language instructions, grounds them into visual observations, and plans feasible aerial trajectories in diverse environments. UAV-VLN leverages the common-sense reasoning capabilities of LLMs to parse high-level semantic goals, while a vision model detects and localizes semantically relevant objects in the environment. By fusing these modalities, the UAV can reason about spatial relationships, disambiguate references in human instructions, and plan context-aware behaviors with minimal task-specific supervision. To ensure robust and interpretable decision-making, the framework includes a cross-modal grounding mechanism that aligns linguistic intent with visual context. We evaluate UAV-VLN across diverse indoor and outdoor navigation scenarios, demonstrating its ability to generalize to novel instructions and environments with minimal task-specific training. Our results show significant improvements in instruction-following accuracy and trajectory efficiency, highlighting the potential of LLM-driven vision-language interfaces for safe, intuitive, and generalizable UAV autonomy.
Problem

Research questions and friction points this paper is trying to address.

Enabling UAVs to navigate via natural language commands
Integrating LLMs with vision for human-interactive navigation
Planning context-aware UAV trajectories in unseen environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates LLMs with visual perception for UAV navigation
Uses cross-modal grounding to align language and vision
Enables free-form natural language instruction interpretation
🔎 Similar Papers
No similar papers found.
P
Pranav Saxena
Birla Institute of Technology and Science Pilani, K.K Birla Goa Campus, Goa, India
N
Nishant Raghuvanshi
Birla Institute of Technology and Science Pilani, K.K Birla Goa Campus, Goa, India
Neena Goveas
Neena Goveas
BITS Pilani Goa Campus
Network ScienceTinyMLIoT networks