Dexterity-BEV: Aligning 3D World and Actions for Generalizable Robot Policies Learning

📅 2026-06-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

226K/year
🤖 AI Summary
Existing end-to-end robotic manipulation approaches rely on 2D visual inputs, which struggle to capture the inherently 3D nature of tasks and suffer from misalignment between perception and action spaces in both spatial and temporal dimensions, limiting generalization. This work proposes a pixel-wise 3D visual representation that constructs aligned vertex maps using camera calibration and depth information, unifying multi-view perception and robot actions within a shared world coordinate frame. To achieve viewpoint-invariant encoding, a bird’s-eye-view (BEV) representation is introduced, complemented by a cross-platform trajectory time-alignment mechanism. The proposed approach substantially mitigates spatiotemporal misalignment between perception and action, significantly enhancing policy generalization and robustness across diverse robots, viewpoints, and human demonstrators. The authors also release pretrained models, code, and a complete data processing pipeline to support reproducibility and further research.
📝 Abstract
End-to-end manipulation policies, combined with web-scale pretrained Vision-Language Models (VLMs), show the promise for generalizable and dexterous robotic manipulation. However, they inherit two key limitations from 2D foundation models: 1) the reliance on 2D RGB inputs that ignores the intrinsically 3D nature of manipulation; and 2) the lack of spatial 3D alignment between input-output spaces as well as across diverse robot embodiments, camera setups, and trajectory datasets. In this paper, we present a series of contributions to address these issues. First, we introduce aligned vertex map and vertex spectrum -- a pixel-wise 3D representation that elevates 2D visual inputs to 3D, using camera calibration and optional depth. This novel input representation marries 3D awareness with the generalization of 2D large VLMs. Then, we propose to align the inputs and outputs of manipulation policies by expressing per-pixel 3D information of each camera view and robot actions to a shared coordinate. Based on this, we designate a canonical Bird's-Eye-View (BEV) alignment frame and innovatively propose to construct BEV images, producing a view-invariant representation robust to camera pose variations. To enable training and evaluation at scale, we develop a comprehensive data processing pipeline to perform such alignments; we also introduce a novel temporal alignment scheme for trajectories across diverse robots, human operators, and datasets. These contributions collectively mitigate input and output spatial-temporal misalignments, improving the consistency and generalization for real-world manipulation. Pretrained checkpoint, source code and data processing pipeline are available in https://hnuzhy.github.io/projects/Dex-BEV.
Problem

Research questions and friction points this paper is trying to address.

3D alignment
robotic manipulation
generalizable policies
spatial misalignment
BEV representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bird's-Eye-View (BEV)
3D alignment
vertex map
view-invariant representation
temporal trajectory alignment
H
Huayi Zhou
DexForce Technology
W
Wei Gao
DexForce Technology
D
Dekun Lu
DexForce Technology
R
Ruiji Liu
DexForce Technology
Z
Zhanqi Zhang
DexForce Technology
Z
Ziyang Zhang
DexForce Technology
J
Jian Chen
DexForce Technology
Wenlve Zhou
Wenlve Zhou
The South China University of Techonology
Artificial IntelligenceComputer Vision
Sheng Xu
Sheng Xu
PhD Student, The Chinese University of Hong Kong, Shenzhen
Reinforcement LearningMachine Learning
S
Shumin Li
DexForce Technology
K
Kangyi Guo
DexForce Technology
S
Shichen Xu
DexForce Technology
Z
Zixin Huang
DexForce Technology
Yongyi Su
Yongyi Su
South China University of Technology
Computer VisionMachine LearningTest-Time Adaptation
K
Kui Jia
DexForce Technology; The Chinese University of Hong Kong, Shenzhen