🤖 AI Summary
Existing vision-language-action (VLA) models exhibit limited generalization under unseen objects, complex backgrounds, and heterogeneous robot embodiments. This work proposes GEAR-VLA, a novel framework that introduces the first geometry-aware unified action representation, enabling cross-robot sharing through an embodiment-normalized interface and decoupling high-level semantics from low-level execution. The approach integrates multi-source embodied pretraining, semantic-aligned 3D spatial features, hybrid discrete-continuous action modeling, and a gradient-decoupled DiT-based mixture-of-experts architecture. GEAR-VLA achieves state-of-the-art performance on LIBERO, zero-shot LIBERO-Plus, and RoboTwin 2.0 benchmarks, attaining success rates of 81.0% and 85.9% on unseen robot embodiments LDT-01 and AgileX, respectively, and demonstrating robustness with a 90.1% success rate across 6,360 grasping trials involving 212 previously unseen objects.
📝 Abstract
Vision-Language-Action (VLA) models achieve strong benchmark performance but still struggle in real-world deployment with unseen objects, background shifts, and different robot embodiments. We argue that this stems from the lack of a unified geometry-aware manipulation representation, leaving existing VLAs vulnerable to low-level trajectory supervision, misaligned 3D features, and embodiment differences. To address this, we propose GEAR-VLA, a VLA framework for learning unified geometry-aware action representations for generalizable robotic manipulation. GEAR-VLA adopts coarse-to-fine action learning, where multi-source embodied pretraining equips the VLM with embodied reasoning and discrete action understanding before latent action tokens connect action semantics to a gradient-decoupled DiT continuous action expert. It further performs semantic-aligned 3D integration by aligning a trainable 3D spatial backbone with the VLA representation while freezing the original VLM-aligned visual pathway. To share this representation across robots, GEAR-VLA uses embodiment canonicalization, where embodiment-aware states and embodiment-invariant actions confine robot differences to the low-level interface. Extensive simulation and real-world experiments demonstrate strong generalization: GEAR-VLA achieves state-of-the-art performance on LIBERO, zero-shot LIBERO-Plus, and RoboTwin 2.0, reaches 85.9% success on AgileX and 81.0% on the pretraining-unseen LDT-01 embodiment, and obtains 90.1% success on a 6,360-trial universal grasping benchmark with 212 unseen objects. Code and models will be released at https://github.com/babynabeauty/GEAR-VLA.