Language-Guided Manipulation with Diffusion Policies and Constrained Inpainting

📅 2024-06-14
🏛️ arXiv.org
📈 Citations: 9
Influential: 1
📄 PDF
🤖 AI Summary
To address the challenges of scarce robot demonstration data and poor open-vocabulary instruction generalization, this paper introduces the first zero-shot language-guided diffusion policy framework. Methodologically, it leverages a vision-language model (VLM) to parse natural language instructions into 3D keyframes and constructs a diffusion policy conditioned on these keyframes; additionally, it proposes a gradient-driven constrained refinement mechanism that dynamically corrects VLM-predicted keyframe errors during diffusion sampling, enabling joint optimization of keyframe fidelity and trajectory distribution alignment. Experiments demonstrate that our approach significantly outperforms fine-tuned language-conditioned policies in both simulation and real-robot settings, achieving substantial improvements in task success rates and cross-task generalization—particularly on unseen vocabulary tasks.

Technology Category

Application Category

📝 Abstract
Diffusion policies have demonstrated robust performance in generative modeling, prompting their application in robotic manipulation controlled via language descriptions. In this paper, we introduce a zero-shot, open-vocabulary diffusion policy method for robot manipulation. Using Vision-Language Models (VLMs), our method transforms linguistic task descriptions into actionable keyframes in 3D space. These keyframes serve to guide the diffusion process via inpainting. However, naively enforcing the diffusion process to adhere to the generated keyframes is problematic: the keyframes from the VLMs may be incorrect and lead to action sequences where the diffusion model performs poorly. To address these challenges, we develop an inpainting optimization strategy that balances adherence to the keyframes v.s. the training data distribution. Experimental evaluations demonstrate that our approach surpasses the performance of traditional fine-tuned language-conditioned methods in both simulated and real-world settings.
Problem

Research questions and friction points this paper is trying to address.

Generalizing language-conditioned diffusion policies to open-vocabulary instructions
Addressing scarcity and cost of robot demonstration datasets
Balancing keyframe adherence with motion priors when VLM outputs are inaccurate
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages vision-language models for translation
Uses constrained inpainting with diffusion policies
Balances keyframe adherence with motion priors
🔎 Similar Papers
No similar papers found.