🤖 AI Summary
Existing diffusion models for real-time, text-driven human motion generation suffer from excessive sampling steps and slow inference, hindering low-latency applications. To address this, we propose Phased Consistency Models (Phased CM), the first framework to introduce consistency modeling into latent-space human motion generation. Our method employs phased distillation coupled with text-conditioned diffusion prior transfer, enabling high-fidelity motion synthesis in just one to three inference steps. Evaluated on benchmarks including AMASS, Phased CM achieves state-of-the-art performance, significantly improving motion fidelity and text controllability. Moreover, it accelerates inference by over 40× compared to leading diffusion-based approaches, supporting end-to-end real-time generation at more than 30 FPS. This breakthrough enables practical deployment in interactive virtual avatars, VR/AR systems, and other latency-critical scenarios.
📝 Abstract
Diffusion models have become a popular choice for human motion synthesis due to their powerful generative capabilities. However, their high computational complexity and large sampling steps pose challenges for real-time applications. Fortunately, the Consistency Model (CM) provides a solution to greatly reduce the number of sampling steps from hundreds to a few, typically fewer than four, significantly accelerating the synthesis of diffusion models. However, its application to text-conditioned human motion synthesis in latent space remains challenging. In this paper, we introduce extbf{MotionPCM}, a phased consistency model-based approach designed to improve the quality and efficiency of real-time motion synthesis in latent space.