🤖 AI Summary
To address the fundamental trade-off between biased gradients induced by the Straight-Through Estimator (STE) and the high computational cost of unbiased zeroth-order (ZO) optimization in quantized neural network training, this paper proposes First-Order-Guided Zeroth-Order Gradient Descent (FOG-ZO). FOG-ZO is the first method to synergistically integrate the directional information from STE with unbiased ZO gradient estimates, enabling low-cost correction of STE bias and substantially reducing reliance on expensive ZO queries. By preserving near-first-order optimization efficiency while improving gradient estimation fidelity, FOG-ZO facilitates efficient quantization-aware pretraining. Experiments on DeiT, ResNet, and LLaMA demonstrate that FOG-ZO achieves up to 8% higher accuracy or a 22-point reduction in perplexity over baselines, with computational overhead reduced by 796× compared to n-SPSA.
📝 Abstract
We study the problem of training neural networks with quantized parameters. Learning low-precision quantized parameters by enabling computation of gradients via the Straight-Through Estimator (STE) can be challenging. While the STE enables back-propagation, which is a first-order method, recent works have explored the use of zeroth-order (ZO) gradient descent for fine-tuning. We note that the STE provides high-quality biased gradients, and ZO gradients are unbiased but can be expensive. We thus propose First-Order-Guided Zeroth-Order Gradient Descent (FOGZO) that reduces STE bias while reducing computations relative to ZO methods. Empirically, we show FOGZO improves the tradeoff between quality and training time in Quantization-Aware Pre-Training. Specifically, versus STE at the same number of iterations, we show a 1-8% accuracy improvement for DeiT Tiny/Small, 1-2% accuracy improvement on ResNet 18/50, and 1-22 perplexity point improvement for LLaMA models with up to 0.3 billion parameters. For the same loss, FOGZO yields a 796$ imes$ reduction in computation versus n-SPSA for a 2-layer MLP on MNIST. Code is available at https://github.com/1733116199/fogzo.