🤖 AI Summary
Monocular visual landing of shipborne UAVs demands accurate and robust 6D relative pose estimation under severe challenges—including scarcity of real-world training data and complex, dynamic maritime illumination conditions.
Method: This paper proposes a depth-enhanced Transformer-based framework for 6D relative pose estimation. To address data scarcity and illumination variability, the method leverages synthetically generated ship images for training, incorporates multi-part 2D keypoint detection, and introduces a Bayesian fusion strategy to enhance robustness against occlusion and noise. Crucially, it is the first work to adapt the Transformer architecture to monocular ship–UAV 6D pose estimation, enabling end-to-end joint optimization of keypoint localization and geometric constraints.
Results: Evaluated on synthetic benchmarks and real flight experiments, the method achieves positional errors within 0.8% and 1.0% of the ship–UAV distance, respectively—demonstrating high accuracy, strong generalization to unseen scenarios, and practical deployability in real-world naval operations.
📝 Abstract
This paper introduces a deep transformer network for estimating the relative 6D pose of a Unmanned Aerial Vehicle (UAV) with respect to a ship using monocular images. A synthetic dataset of ship images is created and annotated with 2D keypoints of multiple ship parts. A Transformer Neural Network model is trained to detect these keypoints and estimate the 6D pose of each part. The estimates are integrated using Bayesian fusion. The model is tested on synthetic data and in-situ flight experiments, demonstrating robustness and accuracy in various lighting conditions. The position estimation error is approximately 0.8% and 1.0% of the distance to the ship for the synthetic data and the flight experiments, respectively. The method has potential applications for ship-based autonomous UAV landing and navigation.