🤖 AI Summary
Existing diffusion-based policies suffer from poor real-time performance due to multi-step denoising, failing to meet the low-latency requirements of dexterous 3D robotic manipulation.
Method: This work pioneers the integration of consistency models into robot action generation, proposing an action-space conditional consistency diffusion framework and a consistency distillation scheme that enables precise single-step action synthesis on a low-dimensional action manifold. The approach leverages point-cloud-driven consistency ODE modeling, single-step forward generation, and multi-task simulation training across Adroit and Meta-World benchmarks.
Contribution/Results: Evaluated on 31 manipulation tasks, the model achieves a 10× speedup in average inference latency over prior diffusion methods while attaining state-of-the-art success rates. Notably, it is the first diffusion-inspired policy successfully deployed online on real robots under strict latency constraints.
📝 Abstract
Diffusion models have been verified to be effective in generating complex distributions from natural images to motion trajectories. Recent diffusion-based methods show impressive performance in 3D robotic manipulation tasks, whereas they suffer from severe runtime inefficiency due to multiple denoising steps, especially with high-dimensional observations. To this end, we propose a real-time robotic manipulation model named ManiCM that imposes the consistency constraint to the diffusion process, so that the model can generate robot actions in only one-step inference. Specifically, we formulate a consistent diffusion process in the robot action space conditioned on the point cloud input, where the original action is required to be directly denoised from any point along the ODE trajectory. To model this process, we design a consistency distillation technique to predict the action sample directly instead of predicting the noise within the vision community for fast convergence in the low-dimensional action manifold. We evaluate ManiCM on 31 robotic manipulation tasks from Adroit and Metaworld, and the results demonstrate that our approach accelerates the state-of-the-art method by 10 times in average inference speed while maintaining competitive average success rate.