🤖 AI Summary
To address the need for manual tuning or grid search in per-channel post-training quantization (PTQ) of large language models (LLMs), this paper proposes a parameter-free automated quantization method. The core innovation leverages the geometric properties of symmetric scalar quantization to analytically derive optimal channel-wise scaling factors directly over a fixed, non-scaled codebook—eliminating heuristic design, backpropagation, and reliance on large calibration datasets. The method supports both symmetric and asymmetric quantization and requires only a single forward pass. Evaluated on mainstream LLMs—including LLaMA and OPT—the approach achieves state-of-the-art performance under stringent settings (e.g., W4A4), while significantly reducing memory footprint and computational overhead. This enhances the efficiency and practicality of deploying large models on edge devices.
📝 Abstract
Quantization is a widely used compression technique for reducing the memory and computation costs of large pre-trained models. A key challenge in per-channel post-training quantization (PTQ) is selecting appropriate scaling factors to replace weight values with values from a scaled quantization grid. Existing methods typically fix the scale at the outset via heuristic tuning or grid search. In this note, we propose Beacon, a simple and effective algorithm that eliminates the need for such manual tuning. Beacon performs per-channel PTQ directly using a fixed non-scaled alphabet and automatically determines the optimal scaling factors by exploiting the geometry of symmetric scalar quantization. It supports both symmetric and asymmetric quantization with minimal modifications and does not rely on back-propagation or large calibration sets. Despite its simplicity and tuning-free nature, Beacon achieves competitive performance compared to state-of-the-art methods, making it a practical solution for efficient model deployment.