AugVLA-3D: Depth-Driven Feature Augmentation for Vision-Language-Action Models

📅 2026-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing vision-language-action (VLA) models, which are predominantly trained on 2D images and consequently struggle to comprehend 3D spatial structures, leading to inaccurate action execution in complex three-dimensional environments. To bridge this gap, the authors propose a geometry-aware framework that integrates depth estimation with action priors. Specifically, they introduce a depth estimation module, VGGT, to extract 3D features from RGB inputs, and design an action assistant module that leverages action priors to align 3D representations with control objectives. By fusing 2D and 3D features, the model enhances overall performance. This study is the first to incorporate depth-driven data augmentation and action-prior constraints into the VLA paradigm, significantly improving perception, action prediction accuracy, generalization, and robustness—particularly in geometrically ambiguous scenarios.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models have recently achieved remarkable progress in robotic perception and control, yet most existing approaches primarily rely on VLM trained using 2D images, which limits their spatial understanding and action grounding in complex 3D environments. To address this limitation, we propose a novel framework that integrates depth estimation into VLA models to enrich 3D feature representations. Specifically, we employ a depth estimation baseline called VGGT to extract geometry-aware 3D cues from standard RGB inputs, enabling efficient utilization of existing large-scale 2D datasets while implicitly recovering 3D structural information. To further enhance the reliability of these depth-derived features, we introduce a new module called action assistant, which constrains the learned 3D representations with action priors and ensures their consistency with downstream control tasks. By fusing the enhanced 3D features with conventional 2D visual tokens, our approach significantly improves the generalization ability and robustness of VLA models. Experimental results demonstrate that the proposed method not only strengthens perception in geometrically ambiguous scenarios but also leads to superior action prediction accuracy. This work highlights the potential of depth-driven data augmentation and auxiliary expert supervision for bridging the gap between 2D observations and 3D-aware decision-making in robotic systems.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
3D perception
depth estimation
robotic control
spatial understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

depth estimation
Vision-Language-Action models
3D feature augmentation
action priors
robotic perception
🔎 Similar Papers
No similar papers found.
Z
Zhifeng Rao
Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China; Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen, China
Wenlong Chen
Wenlong Chen
Research Scientist, Isomorphic Labs
Machine LearningDeep LearningArtificial Intelligence
L
Lei Xie
Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China
Xia Hua
Xia Hua
Zhejiang University of Technology
ResearchMechanical Engineering
D
Dongfu Yin
Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China
Z
Zhen Tian
Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China
F. Richard Yu
F. Richard Yu
Carleton University, FRSC, FCAE, MAE, FIEEE, FEIC
Intell.&Auto. Sys.ML&Embodied AIIoTBlockchain