VGGT-DP: Generalizable Robot Control via Vision Foundation Models

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing visual imitation learning frameworks neglect visual encoder design, limiting spatial understanding and generalization capability. This paper proposes a generalized robot control framework grounded in vision foundation models: it employs the Visual Geometry Grounded Transformer as the visual encoder, integrating geometric priors from pre-trained 3D perception models with proprioceptive feedback. We introduce two key innovations—proprioception-guided per-frame token reuse and stochastic token pruning—to enable efficient, low-latency, and robust policy learning from multi-view inputs. Evaluated on the challenging MetaWorld benchmark, our method significantly outperforms strong baselines including DP and DP3, particularly excelling in high-precision manipulation and long-horizon tasks, where it demonstrates superior generalization and stability.

Technology Category

Application Category

📝 Abstract
Visual imitation learning frameworks allow robots to learn manipulation skills from expert demonstrations. While existing approaches mainly focus on policy design, they often neglect the structure and capacity of visual encoders, limiting spatial understanding and generalization. Inspired by biological vision systems, which rely on both visual and proprioceptive cues for robust control, we propose VGGT-DP, a visuomotor policy framework that integrates geometric priors from a pretrained 3D perception model with proprioceptive feedback. We adopt the Visual Geometry Grounded Transformer (VGGT) as the visual encoder and introduce a proprioception-guided visual learning strategy to align perception with internal robot states, improving spatial grounding and closed-loop control. To reduce inference latency, we design a frame-wise token reuse mechanism that compacts multi-view tokens into an efficient spatial representation. We further apply random token pruning to enhance policy robustness and reduce overfitting. Experiments on challenging MetaWorld tasks show that VGGT-DP significantly outperforms strong baselines such as DP and DP3, particularly in precision-critical and long-horizon scenarios.
Problem

Research questions and friction points this paper is trying to address.

Improving robot spatial understanding and generalization in visual imitation learning
Integrating geometric priors with proprioceptive feedback for robust control
Reducing inference latency and overfitting in visuomotor policy frameworks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates geometric priors from pretrained 3D perception model
Uses proprioception-guided strategy to align perception with robot states
Employs frame-wise token reuse and random pruning for efficiency
🔎 Similar Papers
No similar papers found.
Shijia Ge
Shijia Ge
Tsinghua University
Machine LearningAI3DVRoboticsAI4Med
Y
Yinxin Zhang
Harbin Institute of Technology, Shenzhen, China
Shuzhao Xie
Shuzhao Xie
Tsinghua University
GraphicsMultimedia
Weixiang Zhang
Weixiang Zhang
Tsinghua University
Neural Representation3D Computer Vision
M
Mingcai Zhou
CASBOT, Beijing, China
Z
Zhi Wang
Shenzhen International Graduate School, Tsinghua University, Shenzhen, China