GeoAware-VLA: Implicit Geometry Aware Vision-Language-Action Model

📅 2025-09-17

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Vision-language-action (VLA) models, constrained by 2D image inputs, struggle to encode 3D geometric structure, resulting in poor generalization to novel camera viewpoints. To address this, we propose a geometric prior enhancement framework: we freeze a pre-trained geometric vision model (e.g., depth or surface normal estimator) as a fixed feature extractor and introduce a lightweight, learnable projection layer that implicitly injects geometry-rich features into the VLA policy decoder—requiring neither 3D annotations nor end-to-end fine-tuning. The approach is agnostic to action space type (continuous or discrete). Evaluated on the LIBERO benchmark, it achieves over a twofold improvement in zero-shot transfer success rates. Moreover, on real-robot manipulation tasks, it significantly outperforms existing methods, demonstrating superior 3D consistency and robust operational capability—especially under unseen camera viewpoints.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models often fail to generalize to novel camera viewpoints, a limitation stemming from their difficulty in inferring robust 3D geometry from 2D images. We introduce GeoAware-VLA, a simple yet effective approach that enhances viewpoint invariance by integrating strong geometric priors into the vision backbone. Instead of training a visual encoder or relying on explicit 3D data, we leverage a frozen, pretrained geometric vision model as a feature extractor. A trainable projection layer then adapts these geometrically-rich features for the policy decoder, relieving it of the burden of learning 3D consistency from scratch. Through extensive evaluations on LIBERO benchmark subsets, we show GeoAware-VLA achieves substantial improvements in zero-shot generalization to novel camera poses, boosting success rates by over 2x in simulation. Crucially, these benefits translate to the physical world; our model shows a significant performance gain on a real robot, especially when evaluated from unseen camera angles. Our approach proves effective across both continuous and discrete action spaces, highlighting that robust geometric grounding is a key component for creating more generalizable robotic agents.

Problem

Research questions and friction points this paper is trying to address.

Improving viewpoint invariance in Vision-Language-Action models

Enhancing 3D geometry inference from 2D visual inputs

Enabling zero-shot generalization to novel camera poses

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates geometric priors into vision backbone

Uses frozen pretrained geometric feature extractor

Projects features to policy decoder for 3D consistency

🔎 Similar Papers

No similar papers found.