🤖 AI Summary
This work addresses the challenges of agile quadrotor flight in complex environments—namely partial observability, perception latency, and the need to model environmental geometry and visibility history. To this end, the authors propose Mapping-Aware Dreamer (MAD), which, for the first time, integrates occupancy grids and visibility maps as self-supervised signals into a world model, eschewing conventional image reconstruction. This design enables the latent space to explicitly encode local geometry, visibility history, and ego-motion. Combining recurrent latent dynamics, GPU-accelerated mapping, and diverse policy learning paradigms, MAD substantially outperforms vision-only baselines in visual navigation and drone racing tasks, achieving higher success rates, faster speeds (9.66 m/s in simulation and 5.05 m/s on physical hardware), strong cross-task transferability, and safe deployment in both indoor and outdoor settings under limited perceptual conditions.
📝 Abstract
Agile quadrotor flight in cluttered scenes requires more than a reactive mapping from a depth image to a control command: the vehicle must remember which regions have been observed, infer nearby occupied space, and act under partial visibility and tight latency. In this paper, we present Mapping-Aware Dreamer (MAD), a geometry-aware world model for vision-based quadrotor flight. Instead of using raw-image reconstruction as the main self-supervised objective, MAD learns recurrent latent dynamics that reconstruct robocentric occupancy and visibility grid maps together with proprioceptive states. This design forces the latent state to encode local geometry, visibility history, and ego-motion in a form that is directly relevant to collision avoidance. MAD is trained in DiffAero using a GPU-parallel map-construction module that provides high-throughput supervision for occupancy and visibility. The learned representation is used in three policy-learning modes: imagination-based MAD-Dreamer and feature-extractor variants based on PPO and SHAC. Across visual navigation and racing tasks, MAD-based agents achieve higher success rates, faster flight, and better cross-task transfer than corresponding vision-only baselines. The model also produces interpretable map predictions and accurate ego-motion estimates from depth observations. We further deploy the learned policy on a physical quadrotor with an Intel RealSense D435i and demonstrate safe indoor and outdoor flight under limited sensing, reaching 9.66 m/s in simulation and 5.05 m/s in real-world forest experiments. These results show that mapping-aware world models provide a practical middle ground between modular aerial navigation and end-to-end learning.