FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training

📅 2024-11-12

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 1

career value

189K/year

🤖 AI Summary

Large language model training faces two key challenges: excessive memory overhead from optimizer states and gradient information loss in low-rank update methods (e.g., LoRA, GaLore, BAdam). To address these, this paper proposes a memory-efficient optimization framework that synergistically combines full-rank updates with gradient splitting. Its core innovation is a novel gradient splitting mechanism: gradients are decomposed into a low-dimensional adaptive component—updated via Adam-style optimizers—and a high-dimensional stateless component—updated via SGD or signSGD—thereby jointly ensuring convergence and memory efficiency. The framework further integrates a low-rank selection strategy and provides rigorous theoretical convergence analysis. Under fixed memory budgets, it significantly improves both pretraining and fine-tuning performance, achieving optimal trade-offs between state-of-the-art accuracy and memory efficiency.

Technology Category

Application Category

📝 Abstract

With the increase in the number of parameters in large language models, the process of pre-training and fine-tuning increasingly demands larger volumes of GPU memory. A significant portion of this memory is typically consumed by the optimizer state. To overcome this challenge, recent approaches such as low-rank adaptation (LoRA (Hu et al., 2021)), low-rank gradient projection (GaLore (Zhao et al., 2024)), and blockwise optimization (BAdam (Luo et al., 2024)) have been proposed. However, in all these algorithms, the $ extit{effective rank of the weight updates remains low-rank}$, which can lead to a substantial loss of information from the gradient. This loss can be critically important, especially during the pre-training stage. In this paper, we introduce $ exttt{FRUGAL}$ ($ extbf{F}$ull-$ extbf{R}$ank $ extbf{U}$pdates with $ extbf{G}$r$ extbf{A}$dient sp$ extbf{L}$itting), a new memory-efficient optimization framework. $ exttt{FRUGAL}$ leverages gradient splitting to perform low-dimensional updates using advanced algorithms (such as Adam), while updates along the remaining directions are executed via state-free methods like SGD or signSGD (Bernstein et al., 2018). Our framework can be integrated with various low-rank update selection techniques, including GaLore and BAdam. We provide theoretical convergence guarantees for our framework when using SGDM for low-dimensional updates and SGD for state-free updates. Additionally, our method consistently outperforms concurrent approaches across various fixed memory budgets, achieving state-of-the-art results in pre-training and fine-tuning tasks while balancing memory efficiency and performance metrics.

Problem

Research questions and friction points this paper is trying to address.

Reducing GPU memory usage in large language model training

Addressing information loss from low-rank gradient updates

Balancing memory efficiency and performance in optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gradient splitting for memory efficiency

Combines Adam with state-free methods

Integrates various low-rank update techniques

🔎 Similar Papers

No similar papers found.