π€ AI Summary
Existing vision-language-action (VLA) models primarily focus on semantic alignment, often falling short in capturing the geometric awareness and dynamic manipulability required for embodied tasks. This work proposes GeoAlign, a novel architecture that introduces, for the first time, an ego-state-guided geometric feature querying mechanism. Specifically, the RGB branch is post-trained under RGB-D supervision to generate geometry-enhanced features, which are then dynamically queried using the robotβs ego-state to extract phase-relevant geometric tokens for action prediction. By preserving semantic understanding while achieving spatially precise alignment, GeoAlign substantially improves policy generalization on complex geometric tasks, attaining 99.0% success on LIBERO, an average of 85.3% across three SimplerEnv-Fractal tasks, and 78.8% success rate on eight real-world ALOHA geometric manipulation tasks.
π Abstract
Current Vision--Language--Action (VLA) models often optimize for semantic grounding, whereas executable manipulation requires geometry-aware spatial alignment and dynamic affordance selection. We introduce GeoAlign, a state-guided spatial alignment architecture for VLA policy learning. GeoAlign post-trains an RGB geometry branch with robot-domain RGB-D supervision, yielding RGB-derived Geometry-Enhanced Post-Trained (GEP) features for policy rollout. The robot's proprioceptive state queries the GEP feature grid, producing compact, phase-dependent geometry tokens for action prediction. GeoAlign achieves 99.0% on LIBERO, 85.3% across three SimplerEnv-Fractal tasks, and 78.8% on eight geometry-critical real-world ALOHA tasks, with ablations confirming the value of geometry post-training and proprioceptive-state-guided querying.