Think Like a Pilot: Fine-Grained Long-Horizon UAV Navigation

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing vision-language navigation (VLN) methods for unmanned aerial vehicles typically rely on discrete or coarse-grained actions, limiting their ability to execute semantically complex, long-horizon tasks requiring continuous and smooth control. To address this, this work introduces FLIGHT, a new benchmark featuring fine-grained, multi-stage instructions paired with dense 6-degree-of-freedom trajectories. The authors further propose FLIGHT VLA, an asynchronous dual-frequency vision-language-action architecture: a low-frequency streaming vision-language model performs task reasoning and generates explicit pilot-like textual commands, while a high-frequency diffusion-based action model enables continuous control. Evaluated on the FLIGHT benchmark, this approach significantly outperforms existing VLN and vision-language-action (VLA) methods, achieving superior performance in multi-stage task completion, sub-goal adherence, and terminal control accuracy, while also enhancing video-based reasoning capabilities for drone operation.

📝 Abstract

Language-guided UAV agents must execute long-horizon semantic instructions while producing smooth, physically feasible continuous flight commands, yet existing Vision-Language Navigation (VLN) benchmarks typically use discrete or coarse actions and existing UAV Vision-Language-Action (VLA) tasks focus on short, atomic maneuvers. To address this gap in UAV task settings, we introduce \textbf{FLIGHT}, a \textbf{F}ine-grained \textbf{L}ong-horizon \textbf{I}nstruction-\textbf{G}uided benchmark for \textbf{H}ybrid UAV navigation and reasoning \textbf{T}asks, which combines multi-stage instructions with dense 6-DoF trajectory annotations across two dataset splits: Fine-grained VLN and Long-horizon Flow. To endow the UAV agent with the capability of real-time in-flight reasoning over task execution status and mission planning, while simultaneously accommodating high-frequency, real-time precise control, we further propose \textbf{FLIGHT VLA}, an asynchronous architecture that decouples a low-frequency Streaming Pilot Vision-Language Model (VLM) for task-state reasoning from a high-frequency diffusion action model for continuous control, supervised by explicit \textbf{Pilot Reasoning} texts that summarize the current flight state and anticipate the next subgoal. In closed-loop evaluation, FLIGHT VLA consistently surpasses representative VLN and VLA baselines on our FLIGHT benchmarks, achieving stronger multi-stage completion, subgoal adherence, and terminal control. Its trained Streaming Pilot Reasoning VLM further improves UAV video reasoning, validating the effectiveness of our design.

Problem

Research questions and friction points this paper is trying to address.

UAV navigation

long-horizon instruction

fine-grained control

vision-language-action

continuous flight commands

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained Long-horizon Navigation

Vision-Language-Action (VLA)

Asynchronous Architecture