Language-Guided Manipulation with Diffusion Policies and Constrained Inpainting

📅 2024-06-14

🏛️ arXiv.org

📈 Citations: 9

✨ Influential: 1

career value

185K/year

🤖 AI Summary

To address the challenges of scarce robot demonstration data and poor open-vocabulary instruction generalization, this paper introduces the first zero-shot language-guided diffusion policy framework. Methodologically, it leverages a vision-language model (VLM) to parse natural language instructions into 3D keyframes and constructs a diffusion policy conditioned on these keyframes; additionally, it proposes a gradient-driven constrained refinement mechanism that dynamically corrects VLM-predicted keyframe errors during diffusion sampling, enabling joint optimization of keyframe fidelity and trajectory distribution alignment. Experiments demonstrate that our approach significantly outperforms fine-tuned language-conditioned policies in both simulation and real-robot settings, achieving substantial improvements in task success rates and cross-task generalization—particularly on unseen vocabulary tasks.

Technology Category

Application Category

📝 Abstract

Diffusion policies have demonstrated robust performance in generative modeling, prompting their application in robotic manipulation controlled via language descriptions. In this paper, we introduce a zero-shot, open-vocabulary diffusion policy method for robot manipulation. Using Vision-Language Models (VLMs), our method transforms linguistic task descriptions into actionable keyframes in 3D space. These keyframes serve to guide the diffusion process via inpainting. However, naively enforcing the diffusion process to adhere to the generated keyframes is problematic: the keyframes from the VLMs may be incorrect and lead to action sequences where the diffusion model performs poorly. To address these challenges, we develop an inpainting optimization strategy that balances adherence to the keyframes v.s. the training data distribution. Experimental evaluations demonstrate that our approach surpasses the performance of traditional fine-tuned language-conditioned methods in both simulated and real-world settings.

Problem

Research questions and friction points this paper is trying to address.

Generalizing language-conditioned diffusion policies to open-vocabulary instructions

Addressing scarcity and cost of robot demonstration datasets

Balancing keyframe adherence with motion priors when VLM outputs are inaccurate

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages vision-language models for translation

Uses constrained inpainting with diffusion policies

Balances keyframe adherence with motion priors

🔎 Similar Papers

Paint by Inpaint: Learning to Add Image Objects by Removing Them First

2024-04-28arXiv.orgCitations: 12

Anywhere: A Multi-Agent Framework for Reliable and Diverse Foreground-Conditioned Image Inpainting

2024-04-29arXiv.orgCitations: 1