🤖 AI Summary
Video diffusion models (VDMs) suffer from prohibitive computational and memory overhead, hindering practical deployment; existing quantization methods yield suboptimal performance in video generation. This work proposes the first end-to-end trainable framework supporting ultra-low-bit quantization (≤4 bits), enabling zero-overhead, high-fidelity video synthesis. Our method introduces: (1) gradient-norm-driven quantization-aware training (QAT) to dynamically calibrate quantization error; and (2) a lossless structural pruning strategy combining a decaying auxiliary module Φ with SVD-enhanced rank regularization γ. Evaluated on four state-of-the-art VDMs (1.3B–14B parameters), our approach achieves full-precision-equivalent quality at 4-bit quantization—the first such result. Notably, CogVideoX-2B at 3-bit attains +25.28 improvement in VBench motion dynamics and +8.43 in scene consistency, substantially advancing the frontier of efficient video generation.
📝 Abstract
Video diffusion models (DMs) have enabled high-quality video synthesis. Yet, their substantial computational and memory demands pose serious challenges to real-world deployment, even on high-end GPUs. As a commonly adopted solution, quantization has proven notable success in reducing cost for image DMs, while its direct application to video DMs remains ineffective. In this paper, we present QVGen, a novel quantization-aware training (QAT) framework tailored for high-performance and inference-efficient video DMs under extremely low-bit quantization (e.g., 4-bit or below). We begin with a theoretical analysis demonstrating that reducing the gradient norm is essential to facilitate convergence for QAT. To this end, we introduce auxiliary modules ($Phi$) to mitigate large quantization errors, leading to significantly enhanced convergence. To eliminate the inference overhead of $Phi$, we propose a rank-decay strategy that progressively eliminates $Phi$. Specifically, we repeatedly employ singular value decomposition (SVD) and a proposed rank-based regularization $mathbf{gamma}$ to identify and decay low-contributing components. This strategy retains performance while zeroing out inference overhead. Extensive experiments across $4$ state-of-the-art (SOTA) video DMs, with parameter sizes ranging from $1.3$B $sim14$B, show that QVGen is the first to reach full-precision comparable quality under 4-bit settings. Moreover, it significantly outperforms existing methods. For instance, our 3-bit CogVideoX-2B achieves improvements of $+25.28$ in Dynamic Degree and $+8.43$ in Scene Consistency on VBench.