Improving the Straight-Through Estimator with Zeroth-Order Information

📅 2025-10-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the fundamental trade-off between biased gradients induced by the Straight-Through Estimator (STE) and the high computational cost of unbiased zeroth-order (ZO) optimization in quantized neural network training, this paper proposes First-Order-Guided Zeroth-Order Gradient Descent (FOG-ZO). FOG-ZO is the first method to synergistically integrate the directional information from STE with unbiased ZO gradient estimates, enabling low-cost correction of STE bias and substantially reducing reliance on expensive ZO queries. By preserving near-first-order optimization efficiency while improving gradient estimation fidelity, FOG-ZO facilitates efficient quantization-aware pretraining. Experiments on DeiT, ResNet, and LLaMA demonstrate that FOG-ZO achieves up to 8% higher accuracy or a 22-point reduction in perplexity over baselines, with computational overhead reduced by 796× compared to n-SPSA.

Technology Category

Application Category

📝 Abstract
We study the problem of training neural networks with quantized parameters. Learning low-precision quantized parameters by enabling computation of gradients via the Straight-Through Estimator (STE) can be challenging. While the STE enables back-propagation, which is a first-order method, recent works have explored the use of zeroth-order (ZO) gradient descent for fine-tuning. We note that the STE provides high-quality biased gradients, and ZO gradients are unbiased but can be expensive. We thus propose First-Order-Guided Zeroth-Order Gradient Descent (FOGZO) that reduces STE bias while reducing computations relative to ZO methods. Empirically, we show FOGZO improves the tradeoff between quality and training time in Quantization-Aware Pre-Training. Specifically, versus STE at the same number of iterations, we show a 1-8% accuracy improvement for DeiT Tiny/Small, 1-2% accuracy improvement on ResNet 18/50, and 1-22 perplexity point improvement for LLaMA models with up to 0.3 billion parameters. For the same loss, FOGZO yields a 796$ imes$ reduction in computation versus n-SPSA for a 2-layer MLP on MNIST. Code is available at https://github.com/1733116199/fogzo.
Problem

Research questions and friction points this paper is trying to address.

Training neural networks with quantized parameters using gradient methods
Reducing bias in Straight-Through Estimator gradients for quantization
Improving tradeoff between accuracy and training time in quantization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines first-order and zeroth-order gradient descent
Reduces bias in Straight-Through Estimator gradients
Improves accuracy with reduced computational cost
🔎 Similar Papers
No similar papers found.