🤖 AI Summary
This work addresses the challenges of latency and hallucination risks in large-model-driven drone control within closed-loop systems, as well as the reliance on opaque, task-specific end-to-end policies. The authors propose a decoupled plan-and-execute architecture that leverages a large language model (LLM) for one-shot task planning, with execution delegated to MAVLink via structured ROS 2 interfaces. A lightweight world model is constructed using modular 2D detectors—such as YOLO or vision-language models—combined with pinhole depth projection, while a constraint enforcement layer ensures flight safety. The system supports limited replanning upon execution failure and demonstrates superior interpretability, strict adherence to safety constraints, and a significant reduction in LLM invocation frequency, as validated in Gazebo-PX4 simulations.
📝 Abstract
Foundation models are increasingly used to drive autonomous systems, yet existing approaches either keep the model in a tight control loop, raising latency and hallucination risk, or compile natural language into opaque end-to-end policies that are hard to explain, constraint and require domain-specific datasets and fine-tuning. We propose a planner-executor agent for PX4-based drones that decouples high-level mission planning from low-level control. A large language model performs single-pass task planning, while execution is handled through a structured ROS 2 tool-calling interface bridged to MAVLink. The system constructs a world model by combining modular 2D detectors (e.g., YOLO or vision-language models) with a pinhole depth projection module for 3D object localization. A constraint enforcement layer enforces altitude limits and horizontal geofencing, and bounded replanning enables recovery from execution-time action failures. We position our approach within three common design patterns for foundation-model-based robotics systems and demonstrate its feasibility in PX4 software-in-the-loop simulations in Gazebo. Results highlight improved explainability, constraint enforcement, and reduced LLM calls compared to tightly coupled LLM control. The code, dataset, videos, and other material can be found at the following link: https://github.com/erdemuysalx/PEACE