AugVLA-3D: Depth-Driven Feature Augmentation for Vision-Language-Action Models

📅 2026-02-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limitation of existing vision-language-action (VLA) models, which are predominantly trained on 2D images and consequently struggle to comprehend 3D spatial structures, leading to inaccurate action execution in complex three-dimensional environments. To bridge this gap, the authors propose a geometry-aware framework that integrates depth estimation with action priors. Specifically, they introduce a depth estimation module, VGGT, to extract 3D features from RGB inputs, and design an action assistant module that leverages action priors to align 3D representations with control objectives. By fusing 2D and 3D features, the model enhances overall performance. This study is the first to incorporate depth-driven data augmentation and action-prior constraints into the VLA paradigm, significantly improving perception, action prediction accuracy, generalization, and robustness—particularly in geometrically ambiguous scenarios.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models have recently achieved remarkable progress in robotic perception and control, yet most existing approaches primarily rely on VLM trained using 2D images, which limits their spatial understanding and action grounding in complex 3D environments. To address this limitation, we propose a novel framework that integrates depth estimation into VLA models to enrich 3D feature representations. Specifically, we employ a depth estimation baseline called VGGT to extract geometry-aware 3D cues from standard RGB inputs, enabling efficient utilization of existing large-scale 2D datasets while implicitly recovering 3D structural information. To further enhance the reliability of these depth-derived features, we introduce a new module called action assistant, which constrains the learned 3D representations with action priors and ensures their consistency with downstream control tasks. By fusing the enhanced 3D features with conventional 2D visual tokens, our approach significantly improves the generalization ability and robustness of VLA models. Experimental results demonstrate that the proposed method not only strengthens perception in geometrically ambiguous scenarios but also leads to superior action prediction accuracy. This work highlights the potential of depth-driven data augmentation and auxiliary expert supervision for bridging the gap between 2D observations and 3D-aware decision-making in robotic systems.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

3D perception

depth estimation

robotic control

spatial understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

depth estimation

Vision-Language-Action models

3D feature augmentation

action priors

robotic perception

🔎 Similar Papers

No similar papers found.

Authors to Follow