ZO2: Scalable Zeroth-Order Fine-Tuning for Extremely Large Language Models with Limited GPU Memory

📅 2025-03-16

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

To address severe GPU memory constraints in fine-tuning large language models (LLMs), this paper proposes ZO2—a highly efficient fine-tuning framework integrating zeroth-order optimization (ZO) with CPU-GPU collaborative offloading. Its core innovation lies in a tightly coupled mechanism between dynamic parameter offloading and dual-forward zeroth-order gradient estimation. Moreover, ZO2 introduces low-bit adaptive mixed-precision (AMP) communication for the first time, substantially reducing data transfer overhead while preserving numerical accuracy losslessly. Experiments demonstrate that ZO2 enables full fine-tuning of the OPT-175B model on a single 18GB GPU—reducing peak GPU memory consumption by over 90% compared to baseline ZO methods—while maintaining convergence accuracy and training throughput comparable to standard ZO. This work provides a practical pathway for adapting ultra-large-scale models under stringent resource constraints.

Technology Category

Application Category

📝 Abstract

Fine-tuning large pre-trained LLMs generally demands extensive GPU memory. Traditional first-order optimizers like SGD encounter substantial difficulties due to increased memory requirements from storing activations and gradients during both the forward and backward phases as the model size expands. Alternatively, zeroth-order (ZO) techniques can compute gradients using just forward operations, eliminating the need to store activations. Furthermore, by leveraging CPU capabilities, it's feasible to enhance both the memory and processing power available to a single GPU. We propose a novel framework, ZO2 (Zeroth-Order Offloading), for efficient zeroth-order fine-tuning of LLMs with only limited GPU memory. Our framework dynamically shifts model parameters between the CPU and GPU as required, optimizing computation flow and maximizing GPU usage by minimizing downtime. This integration of parameter adjustments with ZO's double forward operations reduces unnecessary data movement, enhancing the fine-tuning efficacy. Additionally, our framework supports an innovative low-bit precision approach in AMP mode to streamline data exchanges between the CPU and GPU. Employing this approach allows us to fine-tune extraordinarily large models, such as the OPT-175B with more than 175 billion parameters, on a mere 18GB GPU--achievements beyond the reach of traditional methods. Moreover, our framework achieves these results with almost no additional time overhead and absolutely no accuracy loss compared to standard zeroth-order methods. ZO2's code has been open-sourced in https://github.com/liangyuwang/zo2.

Problem

Research questions and friction points this paper is trying to address.

Enables fine-tuning of large language models with limited GPU memory.

Uses zeroth-order techniques to reduce memory requirements during optimization.

Supports efficient CPU-GPU parameter offloading for enhanced computational power.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses zeroth-order techniques for gradient computation.

Dynamically shifts parameters between CPU and GPU.

Implements low-bit precision for efficient data exchange.

🔎 Similar Papers

No similar papers found.