🤖 AI Summary
Existing quantization-aware pretraining methods struggle to balance computational efficiency with information-theoretic optimality, thereby limiting model performance. This work proposes Bell Box Quantization (BBQ), a novel approach that introduces a domain separation strategy: it performs information-theoretically optimal quantization in the input domain while mapping representations to efficient integer-like data types in the output domain. This design uniquely unifies both objectives for the first time and supports ultra-low-bit quantization ranging from 1 to 4 bits. Experimental results demonstrate that BBQ achieves substantial improvements over current methods, reducing perplexity by up to 2, 4, 5, and 18 points at 4-, 3-, 2-, and 1-bit precision, respectively.
📝 Abstract
Quantization-Aware Pre-Training (QAPT) is an effective technique to reduce the compute and memory overhead of Deep Neural Networks while improving their energy efficiency on edge devices. Existing QAPT methods produce models stored in compute-efficient data types (e.g. integers) that are not information theoretically optimal (ITO). On the other hand, existing ITO data types (e.g. Quantile/NormalFloat Quantization) are not compute-efficient. We propose BBQ, the first ITO quantization method that is also compute-efficient. BBQ builds on our key insight that since learning is domain-agnostic, the output of a quantizer does not need to reside in the same domain as its input. BBQ performs ITO quantization in its input domain, and returns its output in a compute-efficient domain where ITO data types are mapped to compute-efficient data types. Without sacrificing compute efficiency, BBQ outperforms prior SOTA QAPT methods by a perplexity reduction of up to 2 points for 4-bit models, up to 4 points for 3-bit models, up to 5 points for 2-bit models, and up to 18 points for 1-bit models. Code is available at https://github.com/1733116199/bbq.