π€ AI Summary
This work addresses the representational coupling between aerial platforms and robotic arms in end-to-end aerial robot control, arising from disparities in action scales, dynamics, and objectives. To resolve this, the authors propose the AIR-VLA+ architecture, which employs a cascaded dual-action decoder to explicitly disentangle flight and manipulation actions, alongside a unidirectional intent-aware coordination mechanism that isolates flight-induced disturbances from arm stability. Innovatively integrating an asymmetric feature-level mixture-of-experts (MoE) module with an implicit visual grasping projector, the approach endows the flight controller with phase-adaptive capabilities and enhanced high-level semantic perception. Evaluated on the AIR-VLA benchmark, the model achieves an average score of 48.0, demonstrating an 80.2% improvement in task success rate over the single-headed Οβ.β
policy and substantially mitigating coordination conflicts between heterogeneous subsystems.
π Abstract
Aerial manipulation systems have long suffered from representation coupling in end-to-end control, as platform-level Unmanned Aerial Vehicle (UAV) movement and end-effector-level arm manipulation differ substantially in action scale, dynamics, and control objectives. In this paper, we propose AIR-VLA+, a flow matching action generation architecture specifically designed for aerial manipulation, featuring cascaded dual-action decoders and an asymmetric feature-level Mixture of Experts (MoE). We construct cascaded manipulation and movement decoders, allowing the UAV to unidirectionally observe the manipulator's intent during movement to achieve workflow coordination, while isolating the impact of UAV movement information backpropagation on arm manipulation stability. Addressing the characteristic that UAV movement is highly dependent on high-level semantics and responsible for task state transitions in aerial manipulation, we design an input feature enhancement module for the UAV movement decoder. This module introduces an implicit visual grasp projector to perceive the interaction state between the gripper and the object, and injects compressed global semantic features. Within the UAV movement decoder, we deploy an implicit MoE architecture, enabling different movement experts to spontaneously exhibit capacity inclinations for various task stages during training. Through dense soft blending computation on the feature manifold, the UAV movement is endowed with stronger task-stage adaptability. Experiments on the standardized AIR-VLA benchmark demonstrate that our method comprehensively surpasses all baselines with an overall average score of 48.0. The overall task completion score improves by 80.2\% compared to the single-head $Ο_{0.5}$ policy, effectively mitigating the heterogeneous coordinated control conflicts of composite robots.