π€ AI Summary
Existing vision pretraining predominantly relies on 2D images, neglecting the intrinsic 3D structure of the physical world and hindered by the scarcity of large-scale annotated 3D data. To address this, we propose FVPβthe first 4D (3D spatial + temporal) visual pretraining framework tailored for real-world robotic learning. FVP formulates pretraining as a spatiotemporal point cloud prediction task and employs diffusion models for self-supervised learning on large-scale RGB-D video sequences. It establishes the inaugural 4D visual pretraining paradigm, unifying and enhancing diverse 3D representation capabilities. Experiments demonstrate that FVP boosts the average success rate of 3D Diffusion Policy by 28% across 12 real-world manipulation tasks and achieves state-of-the-art performance in imitation learning. Moreover, FVP exhibits strong generalization across different encoders and datasets.
π Abstract
General visual representations learned from web-scale datasets for robotics have achieved great success in recent years, enabling data-efficient robot learning on manipulation tasks; yet these pre-trained representations are mostly on 2D images, neglecting the inherent 3D nature of the world. However, due to the scarcity of large-scale 3D data, it is still hard to extract a universal 3D representation from web datasets. Instead, we are seeking a general visual pre-training framework that could improve all 3D representations as an alternative. Our framework, called FVP, is a novel 4D Visual Pre-training framework for real-world robot learning. FVP frames the visual pre-training objective as a next-point-cloud-prediction problem, models the prediction model as a diffusion model, and pre-trains the model on the larger public datasets directly. Across twelve real-world manipulation tasks, FVP boosts the average success rate of 3D Diffusion Policy (DP3) for these tasks by 28%. The FVP pre-trained DP3 achieves state-of-the-art performance across imitation learning methods. Moreover, the efficacy of FVP adapts across various point cloud encoders and datasets. Finally, we apply FVP to the RDT-1B, a larger Vision-Language-Action robotic model, enhancing its performance on various robot tasks. Our project page is available at: https://4d- visual-pretraining.github.io/.