🤖 AI Summary
This work addresses the challenge of obtaining optimal scaling factors in post-training quantization, where conventional data-free heuristics often fall short. The authors propose PiSO, an algorithm that, for the first time, enables precise and efficient optimization of channel-wise (and grouped) scaling factors under round-to-nearest quantization. By partitioning the search space into a finite set of intervals and deriving closed-form optimal solutions within each interval—augmented with an error correction strategy—PiSO significantly enhances low-bit quantization performance. Extensive experiments on Llama and Qwen model families demonstrate consistent improvements across varying model scales and bit widths, with notable reductions in perplexity and gains in zero-shot accuracy, particularly pronounced in ultra-low-bit regimes.
📝 Abstract
Post-training quantization (PTQ) compresses large language models by mapping weights to low-bit representations. The scaling factor that defines the quantization grid is typically chosen using simple, data-free heuristics. In this work, we present PiSO (Piecewise Scale Optimization), an algorithm that leverages calibration data to compute the optimal channel-wise weight scales exactly and efficiently under round-to-nearest quantization. PiSO partitions the scale search space into finitely many intervals on which the objective admits a closed-form minimizer. We extend PiSO to group-wise quantization via principled heuristics and propose effective strategies for interleaving scale optimization with error correction. Experiments on Llama and Qwen models across multiple model sizes and target weight bit-widths demonstrate consistent improvements in perplexity and downstream zero-shot accuracy, both standalone and combined with error correction. In particular, we observe increased benefits as the target bit-width narrows and quantization becomes more challenging.