🤖 AI Summary
This work addresses the challenge of efficiently deploying large video diffusion models, which suffer from high inference step counts and substantial parameter memory requirements. The authors propose a deployment-oriented co-compression framework that jointly integrates few-step distillation and low-bit quantization for the first time. Building upon the Wan2.2 dual-expert architecture, the method separately calibrates the high- and low-noise branches while protecting sensitive entry layers. Following few-step distribution-matching distillation, quantization calibration is performed using a HiF4-style low-bit representation, effectively mitigating activation distribution shift and enhancing dynamic range. Experiments demonstrate that the quantized model surpasses the original full-precision baseline under both 8-step and 20-step inference settings, with the 20-step configuration achieving the best trade-off between generation quality and computational efficiency.
📝 Abstract
Large video diffusion models achieve strong visual quality but remain expensive to deploy because each sample requires many denoising steps and a large resident parameter footprint. This paper studies a deployment-oriented compression pipeline for Wan2.2-T2V-A14B by combining few-step distribution-matching distillation with low-bit quantization. The pipeline follows the model's dual-expert denoising route, calibrates the high-noise and low-noise branches separately, protects sensitive entrance layers, and uses HiF4-style low-bit representation to improve dynamic-range coverage. Quantization is calibrated on the distilled few-step student rather than on the original long-step trajectory, reducing activation-distribution mismatch during inference. The proposed co-design keeps the quantized model close to the same-step full-precision model and surpasses the original full-precision baseline at 8 and 20 steps on average. The 20-step setting gives the best quality-efficiency trade-off in the tested configurations.