🤖 AI Summary
This work addresses the challenge of accurately translating natural language instructions into dynamically feasible six-degree-of-freedom flight trajectories for unmanned aerial vehicles in partially observable environments. The authors propose an imagination-driven framework that first leverages an implicit video diffusion model to generate future visual observations conditioned on the given instruction, then extracts motion intent from these imagined scenes and refines it into a collision-free trajectory using a dynamics-aware planner. By unifying semantic understanding with geometric and dynamical consistency, the method achieves significant performance gains over existing vision-language navigation and vision-language-action (VLA) approaches across multiple benchmarks and real-world flight tests, despite using only 1.3 billion parameters, thereby demonstrating both efficiency and practical applicability.
📝 Abstract
Vision-language navigation (VLN) for UAVs demands grounding free-form instructions into 6-DoF flight under partial observability. While Vision-Language-Action (VLA) models excel at semantic reasoning, they suffer from brittleness due to geometric inconsistency and dynamics mismatch. To address this, we propose ImagineUAV, an imagination-driven framework leveraging cascaded world-action modeling. Instead of direct regression, ImagineUAV employs a latent video diffusion model to generate instruction-conditioned future observations, explicitly imagining environmental evolution, from which 6-DoF motions are inferred via an action extractor. A kinodynamic planner then refines these estimates into collision-free trajectories. Additionally, a step-distilled inference pipeline ensures real-time execution. With only 1.3B parameters, ImagineUAV outperforms prior VLN and VLA baselines on benchmarks and real-world flights, validating the practicality of imagination-driven aerial navigation.