๐ค AI Summary
This paper addresses the fundamental challenge in unmanned aerial vehicle (UAV) visual servoing (VS)โnamely, the inability to navigate when the target is initially invisible. To overcome this limitation, we propose a novel VS framework grounded in latent-space diffusion modeling. Our core innovation is the first integration of a return-conditioned latent-space denoising diffusion probabilistic model (DDPM) into VS, coupled with a cross-modal variational autoencoder to disentangle visual and motor representations. By explicitly modeling and generating optimal control trajectories conditioned on partial or absent visual observations, our method enables reliable navigation even under non-line-of-sight (NLOS) conditions. Unlike conventional VS approaches, it does not require continuous target visibility. Extensive simulations demonstrate stable, high-convergence-rate navigation across both quadrotor and hexarotor platforms, with strong robustness to occlusion and initialization uncertainty. This work establishes a new paradigm for NLOS visual servoing.
๐ Abstract
In this paper, we present a novel visual servoing (VS) approach based on latent Denoising Diffusion Probabilistic Models (DDPMs), that explores the application of generative models for vision-based navigation of UAVs (Uncrewed Aerial Vehicles). Opposite to classical VS methods, the proposed approach allows reaching the desired target view, even when the target is initially not visible. This is possible thanks to the learning of a latent representation that the DDPM uses for planning and a dataset of trajectories encompassing target-invisible initial views. A compact representation is learned from raw images using a Cross-Modal Variational Autoencoder. Given the current image, the DDPM generates trajectories in the latent space driving the robotic platform to the desired visual target. The approach has been validated in simulation using two generic multi-rotor UAVs (a quadrotor and a hexarotor). The results show that we can successfully reach the visual target, even if not visible in the initial view.