🤖 AI Summary
Vision-language-action (VLA) models, constrained by 2D image inputs, struggle to encode 3D geometric structure, resulting in poor generalization to novel camera viewpoints. To address this, we propose a geometric prior enhancement framework: we freeze a pre-trained geometric vision model (e.g., depth or surface normal estimator) as a fixed feature extractor and introduce a lightweight, learnable projection layer that implicitly injects geometry-rich features into the VLA policy decoder—requiring neither 3D annotations nor end-to-end fine-tuning. The approach is agnostic to action space type (continuous or discrete). Evaluated on the LIBERO benchmark, it achieves over a twofold improvement in zero-shot transfer success rates. Moreover, on real-robot manipulation tasks, it significantly outperforms existing methods, demonstrating superior 3D consistency and robust operational capability—especially under unseen camera viewpoints.
📝 Abstract
Vision-Language-Action (VLA) models often fail to generalize to novel camera viewpoints, a limitation stemming from their difficulty in inferring robust 3D geometry from 2D images. We introduce GeoAware-VLA, a simple yet effective approach that enhances viewpoint invariance by integrating strong geometric priors into the vision backbone. Instead of training a visual encoder or relying on explicit 3D data, we leverage a frozen, pretrained geometric vision model as a feature extractor. A trainable projection layer then adapts these geometrically-rich features for the policy decoder, relieving it of the burden of learning 3D consistency from scratch. Through extensive evaluations on LIBERO benchmark subsets, we show GeoAware-VLA achieves substantial improvements in zero-shot generalization to novel camera poses, boosting success rates by over 2x in simulation. Crucially, these benefits translate to the physical world; our model shows a significant performance gain on a real robot, especially when evaluated from unseen camera angles. Our approach proves effective across both continuous and discrete action spaces, highlighting that robust geometric grounding is a key component for creating more generalizable robotic agents.