🤖 AI Summary
Existing vision-language navigation (VLN) methods for unmanned aerial vehicles typically rely on discrete or coarse-grained actions, limiting their ability to execute semantically complex, long-horizon tasks requiring continuous and smooth control. To address this, this work introduces FLIGHT, a new benchmark featuring fine-grained, multi-stage instructions paired with dense 6-degree-of-freedom trajectories. The authors further propose FLIGHT VLA, an asynchronous dual-frequency vision-language-action architecture: a low-frequency streaming vision-language model performs task reasoning and generates explicit pilot-like textual commands, while a high-frequency diffusion-based action model enables continuous control. Evaluated on the FLIGHT benchmark, this approach significantly outperforms existing VLN and vision-language-action (VLA) methods, achieving superior performance in multi-stage task completion, sub-goal adherence, and terminal control accuracy, while also enhancing video-based reasoning capabilities for drone operation.
📝 Abstract
Language-guided UAV agents must execute long-horizon semantic instructions while producing smooth, physically feasible continuous flight commands, yet existing Vision-Language Navigation (VLN) benchmarks typically use discrete or coarse actions and existing UAV Vision-Language-Action (VLA) tasks focus on short, atomic maneuvers. To address this gap in UAV task settings, we introduce \textbf{FLIGHT}, a \textbf{F}ine-grained \textbf{L}ong-horizon \textbf{I}nstruction-\textbf{G}uided benchmark for \textbf{H}ybrid UAV navigation and reasoning \textbf{T}asks, which combines multi-stage instructions with dense 6-DoF trajectory annotations across two dataset splits: Fine-grained VLN and Long-horizon Flow. To endow the UAV agent with the capability of real-time in-flight reasoning over task execution status and mission planning, while simultaneously accommodating high-frequency, real-time precise control, we further propose \textbf{FLIGHT VLA}, an asynchronous architecture that decouples a low-frequency Streaming Pilot Vision-Language Model (VLM) for task-state reasoning from a high-frequency diffusion action model for continuous control, supervised by explicit \textbf{Pilot Reasoning} texts that summarize the current flight state and anticipate the next subgoal. In closed-loop evaluation, FLIGHT VLA consistently surpasses representative VLN and VLA baselines on our FLIGHT benchmarks, achieving stronger multi-stage completion, subgoal adherence, and terminal control. Its trained Streaming Pilot Reasoning VLM further improves UAV video reasoning, validating the effectiveness of our design.