🤖 AI Summary
This work addresses the challenges of indoor drone navigation arising from limited single-view observations, including inadequate occlusion reasoning, difficulty in assessing target visibility, and constrained global scene understanding. To overcome these limitations, the authors propose an integrated navigation framework that combines language instruction grounding, open-vocabulary object localization, multi-view viewpoint selection, and diffusion-based motion planning, unified with nonlinear model predictive control for seamless task planning and trajectory execution. The approach innovatively integrates language-guided reasoning, dynamic viewpoint switching, and diffusion planning to enable long-horizon task decomposition and safe landing site selection. Evaluated in 40 real-world flight trials, the system achieves an 80% task success rate, with the diffusion planner generating feasible trajectories at a 100% success rate, significantly reducing redundant exploration and enhancing navigation efficiency and robustness in complex indoor environments.
📝 Abstract
Indoor UAV navigation requires efficient exploration, scene understanding, and reliable trajectory execution under limited field-of-view observations. Existing vision-based navigation frameworks typically rely on single-view observations, limiting their ability to reason about occlusions, target visibility, and global scene structure. In this work, we propose AgenticDiffusion, a multi-view UAV navigation framework that coordinates language-guided reasoning, open-vocabulary target grounding, vision-based diffusion planning, and NMPC within a unified aerial navigation pipeline. Given a natural language instruction and synchronized first-person-view (FPV) and top-view observations, the framework determines the most informative viewpoint for navigation and generates a mission plan prior to trajectory execution. The targets are localized using an open-vocabulary grounding model, after which viewpoint-specific diffusion planners generate navigation trajectories for UAV execution. Using complementary viewpoints, the proposed framework reduces repeated target exploration and improves navigation efficiency in cluttered indoor environments. The framework was validated in four real-world UAV navigation scenarios involving adaptive viewpoint selection, multi-stage mission execution, long-horizon navigation, and safe landing-site selection. The experimental results demonstrated an overall mission success rate of 80% in 40 real-world trials, while the diffusion planners achieved a trajectory generation success rate of 100%.