GRZO: Group-Relative Zeroth-Order Optimization for Large Language Model Fine-Tuning

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Zeroth-order optimization for large language model fine-tuning is hindered by high-variance gradient estimates. This work proposes GRZO, a method that generates pseudo-independent perturbations per batch and incorporates group-relative normalized loss aggregation, increasing the number of effective gradient directions to match the batch size without additional forward-pass computational overhead while maintaining inference-level memory usage. GRZO is the first zeroth-order optimizer to achieve unbiased gradient directions with variance that scales inversely with batch size, substantially improving upon MeZO’s non-convex convergence guarantees. On Llama3-8B, GRZO yields an average accuracy gain of 3.0% and reduces peak GPU memory consumption by 23%. As a plug-and-play module, it further boosts performance of sparse, low-rank, and quantized zeroth-order variants by an average of 6.0%.

📝 Abstract

Zeroth-order (ZO) optimization is a memory-efficient alternative to backpropagation for fine-tuning large language models, but its deployment is limited by the high variance of gradient estimation. We propose GRZO, a Group-Relative Zeroth-Order optimizer that draws one pseudo-independent perturbation per mini-batch example and aggregates the per-example losses through group-relative normalization, raising the effective gradient-direction count from one to the batch size at no additional forward cost while preserving inference-level memory. We prove that GRZO is directionally unbiased with variance shrinking proportionally to the batch size, yielding a tighter nonconvex convergence bound than MeZO. Across RoBERTa-large, Llama3-8B, and OPT-13B over multiple tasks, GRZO improves average accuracy on Llama3-8B by $+3.0$ over MeZO at $23\%$ lower peak GPU memory; as a drop-in replacement for the MeZO core, it lifts sparse, low-rank, and quantized ZO variants by $+6.0$ on average.

Problem

Research questions and friction points this paper is trying to address.

Zeroth-order optimization

gradient estimation variance

large language model fine-tuning

memory efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zeroth-Order Optimization

Group-Relative Normalization

Memory-Efficient Fine-Tuning

Large Language Models