Real2Edit2Real: Generating Robotic Demonstrations via a 3D Control Interface

📅 2025-12-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Robot manipulation policies are hindered by the high cost of collecting real-world demonstration data and poor spatial generalization. To address this, we propose an end-to-end video generation framework that takes only 1–5 real multi-view manipulation videos as input. Our method first performs metric-scale 3D reconstruction and depth-guided point cloud editing to construct an editable 3D scene. A geometry-constrained robot pose correction mechanism ensures physically consistent pose retargeting. We then design a multi-condition video diffusion model—controlled primarily by depth maps and additionally conditioned on action labels, edge maps, and lighting cues—to synthesize spatially enhanced multi-view manipulation videos. The framework generates high-fidelity training data from minimal real demonstrations, achieving policy learning performance on par with or surpassing baselines trained on 50 real demonstrations. It improves data efficiency by 10–50× and supports flexible editing of 3D scale, texture, and configuration.

Technology Category

Application Category

📝 Abstract
Recent progress in robot learning has been driven by large-scale datasets and powerful visuomotor policy architectures, yet policy robustness remains limited by the substantial cost of collecting diverse demonstrations, particularly for spatial generalization in manipulation tasks. To reduce repetitive data collection, we present Real2Edit2Real, a framework that generates new demonstrations by bridging 3D editability with 2D visual data through a 3D control interface. Our approach first reconstructs scene geometry from multi-view RGB observations with a metric-scale 3D reconstruction model. Based on the reconstructed geometry, we perform depth-reliable 3D editing on point clouds to generate new manipulation trajectories while geometrically correcting the robot poses to recover physically consistent depth, which serves as a reliable condition for synthesizing new demonstrations. Finally, we propose a multi-conditional video generation model guided by depth as the primary control signal, together with action, edge, and ray maps, to synthesize spatially augmented multi-view manipulation videos. Experiments on four real-world manipulation tasks demonstrate that policies trained on data generated from only 1-5 source demonstrations can match or outperform those trained on 50 real-world demonstrations, improving data efficiency by up to 10-50x. Moreover, experimental results on height and texture editing demonstrate the framework's flexibility and extensibility, indicating its potential to serve as a unified data generation framework.
Problem

Research questions and friction points this paper is trying to address.

Generates robotic demonstrations via 3D editing to reduce data collection
Enhances policy robustness by synthesizing spatially augmented manipulation videos
Improves data efficiency up to 50x for real-world manipulation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D reconstruction from multi-view RGB for scene geometry
Depth-reliable 3D editing on point clouds with pose correction
Multi-conditional video generation using depth, action, and edge maps
🔎 Similar Papers
No similar papers found.
Y
Yujie Zhao
CFCS, School of Computer Science, Peking University
Hongwei Fan
Hongwei Fan
Peking University
Robotics3D Vision
D
Di Chen
AgiBot
Shengcong Chen
Shengcong Chen
Unknown affiliation
World ModelComputer VisionEmbodied AIMedical Image Analysis
L
Liliang Chen
AgiBot
X
Xiaoqi Li
CFCS, School of Computer Science, Peking University
G
Guanghui Ren
AgiBot
H
Hao Dong
CFCS, School of Computer Science, Peking University