Language-Guided Object-Centric Diffusion Policy for Generalizable and Collision-Aware Robotic Manipulation

📅 2024-06-29

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

To address poor generalization and lack of collision awareness in robotic manipulation under cluttered scenes, viewpoint variations, and interference from visually similar objects, this paper proposes a language-guided object-centric diffusion policy. The method end-to-end integrates large language model (LLM)-based semantic parsing, open-vocabulary 3D instance segmentation, object-centric point cloud representation, and cost-guided denoising—achieving zero-shot, training-free collision avoidance and cross-scene adaptive manipulation for the first time. Key contributions are: (1) a time-invariant cost-guided generative mechanism based on clean trajectory estimation; and (2) a synergistic modeling framework unifying LLM, 3D segmentation, and diffusion policy. Evaluated on 21 RLBench tasks, the approach achieves 68.7% success rate using only 40 demonstrations—outperforming state-of-the-art 2D and 3D baselines by 29% and 25%, respectively—and maintains robust generalization and collision-free performance under unseen conditions including occlusion and viewpoint shifts.

Technology Category

Application Category

📝 Abstract

Learning from demonstrations faces challenges in generalizing beyond the training data and often lacks collision awareness. This paper introduces Lan-o3dp, a language-guided object-centric diffusion policy framework that can adapt to unseen situations such as cluttered scenes, shifting camera views, and ambiguous similar objects while offering training-free collision avoidance and achieving a high success rate with few demonstrations. We train a diffusion model conditioned on 3D point clouds of task-relevant objects to predict the robot's end-effector trajectories, enabling it to complete the tasks. During inference, we incorporate cost optimization into denoising steps to guide the generated trajectory to be collision-free. We leverage open-set segmentation to obtain the 3D point clouds of related objects. We use a large language model to identify the target objects and possible obstacles by interpreting the user's natural language instructions. To effectively guide the conditional diffusion model using a time-independent cost function, we proposed a novel guided generation mechanism based on the estimated clean trajectories. In the simulation, we showed that diffusion policy based on the object-centric 3D representation achieves a much higher success rate (68.7%) compared to baselines with simple 2D (39.3%) and 3D scene (43.6%) representations across 21 challenging RLBench tasks with only 40 demonstrations. In real-world experiments, we extensively evaluated the generalization in various unseen situations and validated the effectiveness of the proposed zero-shot cost-guided collision avoidance.

Problem

Research questions and friction points this paper is trying to address.

Generalizing robotic manipulation beyond training data

Achieving collision-free trajectories in cluttered scenes

Interpreting natural language for object and obstacle identification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-guided diffusion policy for robotic manipulation

Training-free collision avoidance with cost optimization

Open-set segmentation for 3D object point clouds

🔎 Similar Papers

Language-Guided Manipulation with Diffusion Policies and Constrained Inpainting

2024-06-14arXiv.orgCitations: 9

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey

2024-04-28arXiv.orgCitations: 15

Toyota Research Institute

Los Altos, CA / Cambridge, MA

Research Scientist Intern, Robotic Control Policy (PhD)