GeoAlign: Beyond Semantics with State-Guided Spatial Alignment in VLA Models

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing vision-language-action (VLA) models primarily focus on semantic alignment, often falling short in capturing the geometric awareness and dynamic manipulability required for embodied tasks. This work proposes GeoAlign, a novel architecture that introduces, for the first time, an ego-state-guided geometric feature querying mechanism. Specifically, the RGB branch is post-trained under RGB-D supervision to generate geometry-enhanced features, which are then dynamically queried using the robot’s ego-state to extract phase-relevant geometric tokens for action prediction. By preserving semantic understanding while achieving spatially precise alignment, GeoAlign substantially improves policy generalization on complex geometric tasks, attaining 99.0% success on LIBERO, an average of 85.3% across three SimplerEnv-Fractal tasks, and 78.8% success rate on eight real-world ALOHA geometric manipulation tasks.

📝 Abstract

Current Vision--Language--Action (VLA) models often optimize for semantic grounding, whereas executable manipulation requires geometry-aware spatial alignment and dynamic affordance selection. We introduce GeoAlign, a state-guided spatial alignment architecture for VLA policy learning. GeoAlign post-trains an RGB geometry branch with robot-domain RGB-D supervision, yielding RGB-derived Geometry-Enhanced Post-Trained (GEP) features for policy rollout. The robot's proprioceptive state queries the GEP feature grid, producing compact, phase-dependent geometry tokens for action prediction. GeoAlign achieves 99.0% on LIBERO, 85.3% across three SimplerEnv-Fractal tasks, and 78.8% on eight geometry-critical real-world ALOHA tasks, with ablations confirming the value of geometry post-training and proprioceptive-state-guided querying.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

spatial alignment

geometry-aware

affordance selection

robotic manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial alignment

geometry-aware learning

proprioceptive-state-guided querying