🤖 AI Summary
Existing aerial-ground collaborative robot navigation in unstructured environments suffers from reliance on scene-specific fine-tuning and weak coupling between multimodal perception and localization.
Method: This paper proposes the first monocular-camera-driven, unified aerial-ground object-centric mapping framework. It integrates monocular visual SLAM, object-level semantic segmentation, and 3D pose estimation—enabling zero-shot generalization to diverse object categories and reconstructing their metric 3D spatial positions without per-scene adaptation. The resulting digital twin semantic map is platform-agnostic and shareable across heterogeneous robots. A dedicated aerial-ground cooperative navigation control strategy is further introduced to support autonomous localization and target search in dynamic environments.
Results: Evaluated in simulated search-and-rescue missions, the MorphoGear UAV achieves real-time detection and tracking of a quadrupedal robot (<30 ms/frame), demonstrating the system’s robustness and real-time performance under complex, unstructured conditions.
📝 Abstract
This paper presents a novel mapping approach for a universal aerial-ground robotic system utilizing a single monocular camera. The proposed system is capable of detecting a diverse range of objects and estimating their positions without requiring fine-tuning for specific environments. The system's performance was evaluated through a simulated search-and-rescue scenario, where the MorphoGear robot successfully located a robotic dog while an operator monitored the process. This work contributes to the development of intelligent, multimodal robotic systems capable of operating in unstructured environments.