VLM-TDP: VLM-guided Trajectory-conditioned Diffusion Policy for Robust Long-Horizon Manipulation

📅 2025-07-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Diffusion-based policies for robotic manipulation suffer from performance degradation on long-horizon tasks and heightened sensitivity to image noise. To address these limitations, we propose Vision-Language Model-guided Trajectory-Conditioned Diffusion Policies (VLM-TDP). Our method leverages a vision-language model (VLM) to automatically decompose long-horizon tasks into semantically grounded subtasks and generate voxelized trajectory representations as critical conditional inputs to the diffusion policy—enabling semantic-aware guidance and environment adaptation. Trained solely on demonstration data, VLM-TDP is rigorously evaluated in both simulation and real-world robotic platforms. Experiments demonstrate a 44% average success rate improvement in simulation, over 100% gain on long-horizon tasks, 20% reduction in performance degradation under image noise, and significantly superior real-world performance versus baselines. The core contribution lies in the first integration of VLM-driven semantic reasoning with trajectory-conditioned diffusion modeling, establishing a new paradigm for robust, long-horizon embodied manipulation.

Technology Category

Application Category

📝 Abstract

Diffusion policy has demonstrated promising performance in the field of robotic manipulation. However, its effectiveness has been primarily limited in short-horizon tasks, and its performance significantly degrades in the presence of image noise. To address these limitations, we propose a VLM-guided trajectory-conditioned diffusion policy (VLM-TDP) for robust and long-horizon manipulation. Specifically, the proposed method leverages state-of-the-art vision-language models (VLMs) to decompose long-horizon tasks into concise, manageable sub-tasks, while also innovatively generating voxel-based trajectories for each sub-task. The generated trajectories serve as a crucial conditioning factor, effectively steering the diffusion policy and substantially enhancing its performance. The proposed Trajectory-conditioned Diffusion Policy (TDP) is trained on trajectories derived from demonstration data and validated using the trajectories generated by the VLM. Simulation experimental results indicate that our method significantly outperforms classical diffusion policies, achieving an average 44% increase in success rate, over 100% improvement in long-horizon tasks, and a 20% reduction in performance degradation in challenging conditions, such as noisy images or altered environments. These findings are further reinforced by our real-world experiments, where the performance gap becomes even more pronounced in long-horizon tasks. Videos are available on https://youtu.be/g0T6h32OSC8

Problem

Research questions and friction points this paper is trying to address.

Enhances robotic manipulation for long-horizon tasks using VLM-guided diffusion.

Reduces performance degradation in noisy or altered environments significantly.

Decomposes complex tasks into manageable sub-tasks with voxel-based trajectories.

Innovation

Methods, ideas, or system contributions that make the work stand out.

VLM-guided task decomposition for sub-tasks

Voxel-based trajectory generation for conditioning

Trajectory-conditioned diffusion policy training

🔎 Similar Papers

No similar papers found.

Authors to Follow