🤖 AI Summary
Existing 3D vision-based motion policies rely heavily on large-scale real-world demonstrations, and conventional data augmentation struggles to generate diverse combinations of scenes, skills, and objects, limiting generalization in complex, long-horizon tasks. This work proposes Task-Edit, a novel framework that, for the first time, adopts a task-centric perspective by decoupling tasks into three core elements—scene, skill, and object—and enables flexible recombination to synthesize diverse demonstration trajectories. By moving beyond object-centric transformations, Task-Edit leverages depth images and point cloud inputs to support end-to-end training across multiple robotic platforms. Experiments demonstrate that Task-Edit significantly improves policy performance and cross-scenario generalization in challenging, hard-to-collect settings involving perturbations, obstacle avoidance, and high clutter.
📝 Abstract
3D visuomotor policies offer a promising direction for complex robotic manipulation, as depth maps and point clouds provide rich geometric information for spatial reasoning. However, their success often depends on large-scale real-world demonstrations, which are costly and time-consuming to collect. To this end, existing methods commonly use demonstration generation strategies to improve data efficiency by applying object-centric transformations to human-collected demonstrations, such as varying object poses or scales. While effective for local variation, these transformations largely preserve the original scene structure and skill sequence, limiting their ability to synthesize diverse scene-skill-object combinations for complex tasks. In this paper, we propose Task-Edit, a novel demonstration generation framework that generates diverse trajectories from a task-centric editing perspective. The key insight of Task-Edit is to decompose a task into scene, skill and object components, and flexibly recombine them. In this way, Task-Edit enables scalable demonstration generation and significantly improves generalization for long-horizon manipulation tasks. We evaluate Task-Edit through extensive real-world experiments and demonstrate three advantages: (1) Effectiveness: Task-Edit significantly improves 3D visuomotor policies across various real-world tasks and robot embodiments. (2) Generalizability: Task-Edit improves model generalization across different scenario setups. (3) Applicability: Task-Edit enables models to handle scenarios that are difficult to collect in the real world, including disturbance resistance, obstacle avoidance and unseen cluttered scenes.