๐ค AI Summary
Diffusion-based policies for robotic manipulation suffer from performance degradation on long-horizon tasks and heightened sensitivity to image noise. To address these limitations, we propose Vision-Language Model-guided Trajectory-Conditioned Diffusion Policies (VLM-TDP). Our method leverages a vision-language model (VLM) to automatically decompose long-horizon tasks into semantically grounded subtasks and generate voxelized trajectory representations as critical conditional inputs to the diffusion policyโenabling semantic-aware guidance and environment adaptation. Trained solely on demonstration data, VLM-TDP is rigorously evaluated in both simulation and real-world robotic platforms. Experiments demonstrate a 44% average success rate improvement in simulation, over 100% gain on long-horizon tasks, 20% reduction in performance degradation under image noise, and significantly superior real-world performance versus baselines. The core contribution lies in the first integration of VLM-driven semantic reasoning with trajectory-conditioned diffusion modeling, establishing a new paradigm for robust, long-horizon embodied manipulation.
๐ Abstract
Diffusion policy has demonstrated promising performance in the field of robotic manipulation. However, its effectiveness has been primarily limited in short-horizon tasks, and its performance significantly degrades in the presence of image noise. To address these limitations, we propose a VLM-guided trajectory-conditioned diffusion policy (VLM-TDP) for robust and long-horizon manipulation. Specifically, the proposed method leverages state-of-the-art vision-language models (VLMs) to decompose long-horizon tasks into concise, manageable sub-tasks, while also innovatively generating voxel-based trajectories for each sub-task. The generated trajectories serve as a crucial conditioning factor, effectively steering the diffusion policy and substantially enhancing its performance. The proposed Trajectory-conditioned Diffusion Policy (TDP) is trained on trajectories derived from demonstration data and validated using the trajectories generated by the VLM. Simulation experimental results indicate that our method significantly outperforms classical diffusion policies, achieving an average 44% increase in success rate, over 100% improvement in long-horizon tasks, and a 20% reduction in performance degradation in challenging conditions, such as noisy images or altered environments. These findings are further reinforced by our real-world experiments, where the performance gap becomes even more pronounced in long-horizon tasks. Videos are available on https://youtu.be/g0T6h32OSC8