RoboTransfer: Geometry-Consistent Video Diffusion for Robotic Visual Policy Transfer

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In robotic imitation learning, acquiring real-world demonstration data is costly, and sim-to-real transfer is hindered by geometric inconsistencies causing visual distortions in rendered observations. Method: This paper introduces the first diffusion-based framework that jointly incorporates multi-view geometric constraints (depth and surface normals) and cross-view feature interaction to synthesize fine-grained, geometry-consistent robotic manipulation videos from multiple viewpoints. The approach explicitly encodes 3D geometric priors to enhance structural fidelity and visual realism, while supporting background editing and object replacement. Contribution/Results: Evaluated on the DIFF-OBJ and DIFF-ALL cross-domain tasks, policies trained solely on synthetic videos generated by our method achieve success rate improvements of 33.3% and 251%, respectively—significantly mitigating the sim-to-real gap.

Technology Category

Application Category

📝 Abstract
Imitation Learning has become a fundamental approach in robotic manipulation. However, collecting large-scale real-world robot demonstrations is prohibitively expensive. Simulators offer a cost-effective alternative, but the sim-to-real gap make it extremely challenging to scale. Therefore, we introduce RoboTransfer, a diffusion-based video generation framework for robotic data synthesis. Unlike previous methods, RoboTransfer integrates multi-view geometry with explicit control over scene components, such as background and object attributes. By incorporating cross-view feature interactions and global depth/normal conditions, RoboTransfer ensures geometry consistency across views. This framework allows fine-grained control, including background edits and object swaps. Experiments demonstrate that RoboTransfer is capable of generating multi-view videos with enhanced geometric consistency and visual fidelity. In addition, policies trained on the data generated by RoboTransfer achieve a 33.3% relative improvement in the success rate in the DIFF-OBJ setting and a substantial 251% relative improvement in the more challenging DIFF-ALL scenario. Explore more demos on our project page: https://horizonrobotics.github.io/robot_lab/robotransfer
Problem

Research questions and friction points this paper is trying to address.

Bridging sim-to-real gap in robotic imitation learning
Generating geometry-consistent multi-view robotic videos
Enhancing policy training with synthetic visual data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion-based video generation for robotic data
Multi-view geometry with explicit scene control
Cross-view feature interactions for consistency
🔎 Similar Papers
No similar papers found.