🤖 AI Summary
Video diffusion models suffer from low sampling efficiency and difficulty in simultaneously ensuring temporal coherence and visual fidelity—especially for long sequences and large-scale models. To address this, we propose POSE, the first framework enabling stable one-step generation for large-scale video diffusion models. POSE introduces a three-stage mechanism—stability warm-up, unified adversarial equilibrium, and conditional adversarial consistency—to model one-step generation trajectories directly in Gaussian noise space. It integrates two-stage knowledge distillation, self-adversarial training, Nash equilibrium optimization, and conditional consistency constraints. On VBench-I2V, POSE improves semantic alignment and temporal quality by 7.15% on average, reduces inference latency from 1000 to 10 seconds (100× speedup), and achieves generation quality comparable to original multi-step models.
📝 Abstract
The field of video diffusion generation faces critical bottlenecks in sampling efficiency, especially for large-scale models and long sequences. Existing video acceleration methods adopt image-based techniques but suffer from fundamental limitations: they neither model the temporal coherence of video frames nor provide single-step distillation for large-scale video models. To bridge this gap, we propose POSE (Phased One-Step Equilibrium), a distillation framework that reduces the sampling steps of large-scale video diffusion models, enabling the generation of high-quality videos in a single step. POSE employs a carefully designed two-phase process to distill video models:(i) stability priming: a warm-up mechanism to stabilize adversarial distillation that adapts the high-quality trajectory of the one-step generator from high to low signal-to-noise ratio regimes, optimizing the video quality of single-step mappings near the endpoints of flow trajectories. (ii) unified adversarial equilibrium: a flexible self-adversarial distillation mechanism that promotes stable single-step adversarial training towards a Nash equilibrium within the Gaussian noise space, generating realistic single-step videos close to real videos. For conditional video generation, we propose (iii) conditional adversarial consistency, a method to improve both semantic consistency and frame consistency between conditional frames and generated frames. Comprehensive experiments demonstrate that POSE outperforms other acceleration methods on VBench-I2V by average 7.15% in semantic alignment, temporal conference and frame quality, reducing the latency of the pre-trained model by 100$ imes$, from 1000 seconds to 10 seconds, while maintaining competitive performance.