VLM-TDP: VLM-guided Trajectory-conditioned Diffusion Policy for Robust Long-Horizon Manipulation

๐Ÿ“… 2025-07-06
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Diffusion-based policies for robotic manipulation suffer from performance degradation on long-horizon tasks and heightened sensitivity to image noise. To address these limitations, we propose Vision-Language Model-guided Trajectory-Conditioned Diffusion Policies (VLM-TDP). Our method leverages a vision-language model (VLM) to automatically decompose long-horizon tasks into semantically grounded subtasks and generate voxelized trajectory representations as critical conditional inputs to the diffusion policyโ€”enabling semantic-aware guidance and environment adaptation. Trained solely on demonstration data, VLM-TDP is rigorously evaluated in both simulation and real-world robotic platforms. Experiments demonstrate a 44% average success rate improvement in simulation, over 100% gain on long-horizon tasks, 20% reduction in performance degradation under image noise, and significantly superior real-world performance versus baselines. The core contribution lies in the first integration of VLM-driven semantic reasoning with trajectory-conditioned diffusion modeling, establishing a new paradigm for robust, long-horizon embodied manipulation.

Technology Category

Application Category

๐Ÿ“ Abstract
Diffusion policy has demonstrated promising performance in the field of robotic manipulation. However, its effectiveness has been primarily limited in short-horizon tasks, and its performance significantly degrades in the presence of image noise. To address these limitations, we propose a VLM-guided trajectory-conditioned diffusion policy (VLM-TDP) for robust and long-horizon manipulation. Specifically, the proposed method leverages state-of-the-art vision-language models (VLMs) to decompose long-horizon tasks into concise, manageable sub-tasks, while also innovatively generating voxel-based trajectories for each sub-task. The generated trajectories serve as a crucial conditioning factor, effectively steering the diffusion policy and substantially enhancing its performance. The proposed Trajectory-conditioned Diffusion Policy (TDP) is trained on trajectories derived from demonstration data and validated using the trajectories generated by the VLM. Simulation experimental results indicate that our method significantly outperforms classical diffusion policies, achieving an average 44% increase in success rate, over 100% improvement in long-horizon tasks, and a 20% reduction in performance degradation in challenging conditions, such as noisy images or altered environments. These findings are further reinforced by our real-world experiments, where the performance gap becomes even more pronounced in long-horizon tasks. Videos are available on https://youtu.be/g0T6h32OSC8
Problem

Research questions and friction points this paper is trying to address.

Enhances robotic manipulation for long-horizon tasks using VLM-guided diffusion.
Reduces performance degradation in noisy or altered environments significantly.
Decomposes complex tasks into manageable sub-tasks with voxel-based trajectories.
Innovation

Methods, ideas, or system contributions that make the work stand out.

VLM-guided task decomposition for sub-tasks
Voxel-based trajectory generation for conditioning
Trajectory-conditioned diffusion policy training
๐Ÿ”Ž Similar Papers
No similar papers found.
K
Kefeng Huang
Tencent Robotics X, China
Tingguang Li
Tingguang Li
Tencent Robotics X
Reinforcement LearningRobotics
Y
Yuzhen Liu
Tencent Robotics X, China
Z
Zhe Zhang
Southern University of Science and Technology in Shenzhen, China
Jiankun Wang
Jiankun Wang
Southern University of Science and Technology
RoboticsPath PlanningMotion ControlHuman-Robot Interaction
L
Lei Han
Tencent Robotics X, China