GeoAlign: Beyond Semantics with State-Guided Spatial Alignment in VLA Models

πŸ“… 2026-06-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

206K/year
πŸ€– AI Summary
Existing vision-language-action (VLA) models primarily focus on semantic alignment, often falling short in capturing the geometric awareness and dynamic manipulability required for embodied tasks. This work proposes GeoAlign, a novel architecture that introduces, for the first time, an ego-state-guided geometric feature querying mechanism. Specifically, the RGB branch is post-trained under RGB-D supervision to generate geometry-enhanced features, which are then dynamically queried using the robot’s ego-state to extract phase-relevant geometric tokens for action prediction. By preserving semantic understanding while achieving spatially precise alignment, GeoAlign substantially improves policy generalization on complex geometric tasks, attaining 99.0% success on LIBERO, an average of 85.3% across three SimplerEnv-Fractal tasks, and 78.8% success rate on eight real-world ALOHA geometric manipulation tasks.
πŸ“ Abstract
Current Vision--Language--Action (VLA) models often optimize for semantic grounding, whereas executable manipulation requires geometry-aware spatial alignment and dynamic affordance selection. We introduce GeoAlign, a state-guided spatial alignment architecture for VLA policy learning. GeoAlign post-trains an RGB geometry branch with robot-domain RGB-D supervision, yielding RGB-derived Geometry-Enhanced Post-Trained (GEP) features for policy rollout. The robot's proprioceptive state queries the GEP feature grid, producing compact, phase-dependent geometry tokens for action prediction. GeoAlign achieves 99.0% on LIBERO, 85.3% across three SimplerEnv-Fractal tasks, and 78.8% on eight geometry-critical real-world ALOHA tasks, with ablations confirming the value of geometry post-training and proprioceptive-state-guided querying.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
spatial alignment
geometry-aware
affordance selection
robotic manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial alignment
geometry-aware learning
proprioceptive-state-guided querying
post-training
Vision-Language-Action (VLA)
πŸ”Ž Similar Papers
2024-06-03International Conference on Machine LearningCitations: 19
Y
Yizhi Chen
Tongji University
Zhanxiang Cao
Zhanxiang Cao
δΈŠζ΅·δΊ€ι€šε€§ε­¦
RoboticsReinforcement LearningLegged Robot
X
Xinyi Peng
Tongji University
Y
Yixiao Zheng
HONOR
X
Xiaxi Si
Shanghai Jiao Tong University
Y
Yiheng Li
Shanghai Jiao Tong University
L
Liyun Yan
Shanghai Jiao Tong University
K
Keqi Zhu
Zhejiang University
X
Xueyun Chen
Jingdezhen Ceramic University
S
Shengcheng Fu
Tongji University
T
Tianyue Zhan
Shanghai Jiao Tong University
Y
Yufei Jia
Tsinghua University
J
Jinming Yao
University of Science and Technology of China
Y
Yan Xie
HONOR
K
Kun Wang
HONOR
C
Cewu Lu
Shanghai Jiao Tong University
Y
Yue Gao
Shanghai Jiao Tong University