๐ค AI Summary
Existing vision-language-action (VLA) models perform well in single-drone tasks, yet their closed-loop collaborative capabilities in aerial-ground teaming scenarios remain unclear. This work proposes CARLA-Airโthe first single-process, state-synchronized simulation framework for aerial-ground collaboration built on Unreal Engineโand evaluates VLA models through two challenging tasks: mobile platform landing and occlusion-resilient escort. Experimental results reveal that current models critically lack three core competencies: embodied understanding of teammate states, low-latency action coordination, and alignment with shared team objectives. Furthermore, the study finds that state-based prompting yields limited benefits, and naive bidirectional interaction often amplifies errors, thereby highlighting key challenges and offering new directions for the design of collaborative VLA systems.
๐ Abstract
Recent aerial vision-language-action (VLA) models show promising single-UAV capabilities, such as tracking moving objects and navigating to language-specified landmarks. However, it remains unclear whether these capabilities can transfer to air-ground cooperation, where a UAV and a UGV must act jointly in a shared, closed-loop physical world.
We study this question with CARLA-Air, a single-process air-ground evaluation environment that unifies CARLA and AirSim inside one Unreal Engine runtime. By sharing the same world state, physics tick, and sensing pipeline, CARLA-Air enables physically consistent UAV--UGV interaction and precise measurement of simulation-timestamp alignment and effective coordination latency.
Using CARLA-Air, we evaluate representative aerial VLA and planning baselines on two complementary diagnostic tasks: moving-platform landing and occlusion-recovery escort. The results show that current aerial VLA models can often track or follow a ground partner, but struggle to convert this single-agent competence into stable cooperative behavior. State prompting provides limited benefit, and naive bidirectional interaction fails to consistently improve performance and can amplify errors for most baselines. These findings suggest that, under the tested text-based cue interfaces, zero-shot cooperative air-ground VLA requires three components beyond the current paradigm: explicit partner-state grounding, low-latency action coordination, and team-level objective alignment. Our code is available at https://github.com/louiszengCN/CarlaAir.