SINGER: An Onboard Generalist Vision-Language Navigation Policy for Drones

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Open-vocabulary autonomous UAV navigation faces challenges including scarce demonstration data, stringent real-time control requirements, and unreliable external pose estimation. This paper proposes SINGER, a novel framework integrating a high-fidelity language-embedded simulator, RRT-inspired multi-trajectory expert data generation, and a lightweight end-to-end vision–motion policy network. SINGER achieves zero-shot, cross-domain, open-vocabulary language navigation without external localization—the first such approach. Leveraging Gaussian Splatting, we construct low-simulation-gap environments to enhance data realism; the trained policy supports onboard real-time closed-loop control. Real-world flight experiments demonstrate significant improvements: a 23.33% increase in task completion rate over semantic-guided baselines, a 16.67% improvement in target field-of-view retention, and a 10% reduction in collision rate.

Technology Category

Application Category

📝 Abstract
Large vision-language models have driven remarkable progress in open-vocabulary robot policies, e.g., generalist robot manipulation policies, that enable robots to complete complex tasks specified in natural language. Despite these successes, open-vocabulary autonomous drone navigation remains an unsolved challenge due to the scarcity of large-scale demonstrations, real-time control demands of drones for stabilization, and lack of reliable external pose estimation modules. In this work, we present SINGER for language-guided autonomous drone navigation in the open world using only onboard sensing and compute. To train robust, open-vocabulary navigation policies, SINGER leverages three central components: (i) a photorealistic language-embedded flight simulator with minimal sim-to-real gap using Gaussian Splatting for efficient data generation, (ii) an RRT-inspired multi-trajectory generation expert for collision-free navigation demonstrations, and these are used to train (iii) a lightweight end-to-end visuomotor policy for real-time closed-loop control. Through extensive hardware flight experiments, we demonstrate superior zero-shot sim-to-real transfer of our policy to unseen environments and unseen language-conditioned goal objects. When trained on ~700k-1M observation action pairs of language conditioned visuomotor data and deployed on hardware, SINGER outperforms a velocity-controlled semantic guidance baseline by reaching the query 23.33% more on average, and maintains the query in the field of view 16.67% more on average, with 10% fewer collisions.
Problem

Research questions and friction points this paper is trying to address.

Developing autonomous drone navigation using natural language commands
Solving open-vocabulary navigation without external pose estimation
Addressing real-time control demands for drone stabilization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gaussian Splatting photorealistic simulator for data generation
RRT-inspired multi-trajectory expert for collision-free navigation
Lightweight end-to-end visuomotor policy for real-time control
Maximilian Adang
Maximilian Adang
PhD Candidate, Aeronautics & Astronautics @ Stanford University
RoboticsPerceptionDynamics and Control
J
JunEn Low
Department of Mechanical Engineering, Stanford University, Stanford, CA 94404, USA
O
Ola Shorinwa
Department of Aeronautics and Astronautics, Stanford University, Stanford, CA 94404, USA
Mac Schwager
Mac Schwager
Stanford University
RoboticsControlMulti-Agent SystemsMachine LearningStatistical Inference and Estimation