🤖 AI Summary
Diffusion models suffer from high hardware overhead, hindering practical deployment. Existing quantization methods predominantly rely on uniform scalar quantization (USQ), whereas vector quantization (VQ) has demonstrated superior efficacy in large language models (LLMs). This work introduces, for the first time, codebook-driven additive vector quantization (AVQ) for diffusion model compression, proposing a W2/W4 ultra-low-bit weight compression framework tailored for inference acceleration—featuring a custom efficient inference kernel and class-conditional generation adaptation. Evaluated on LDM-4/Imagenet, our method establishes a new Pareto-optimal frontier at ultra-low bitwidths: at W4A8, sFID improves by 1.92 over full-precision; at W2A8, it achieves state-of-the-art FID, sFID, and ISC. Real-world measurements show substantial cross-hardware FLOPs reduction. By transcending traditional scalar quantization limitations, this work establishes a novel paradigm for lightweight diffusion modeling.
📝 Abstract
Significant investments have been made towards the commodification of diffusion models for generation of diverse media. Their mass-market adoption is however still hobbled by the intense hardware resource requirements of diffusion model inference. Model quantization strategies tailored specifically towards diffusion models have been useful in easing this burden, yet have generally explored the Uniform Scalar Quantization (USQ) family of quantization methods. In contrast, Vector Quantization (VQ) methods, which operate on groups of multiple related weights as the basic unit of compression, have seen substantial success in Large Language Model (LLM) quantization. In this work, we apply codebook-based additive vector quantization to the problem of diffusion model compression. Our resulting approach achieves a new Pareto frontier for the extremely low-bit weight quantization on the standard class-conditional benchmark of LDM-4 on ImageNet at 20 inference time steps. Notably, we report sFID 1.92 points lower than the full-precision model at W4A8 and the best-reported results for FID, sFID and ISC at W2A8. We are also able to demonstrate FLOPs savings on arbitrary hardware via an efficient inference kernel, as opposed to savings resulting from small integer operations which may lack broad hardware support.