FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training

📅 2024-11-12
🏛️ arXiv.org
📈 Citations: 2
Influential: 1
📄 PDF
🤖 AI Summary
Large language model training faces two key challenges: excessive memory overhead from optimizer states and gradient information loss in low-rank update methods (e.g., LoRA, GaLore, BAdam). To address these, this paper proposes a memory-efficient optimization framework that synergistically combines full-rank updates with gradient splitting. Its core innovation is a novel gradient splitting mechanism: gradients are decomposed into a low-dimensional adaptive component—updated via Adam-style optimizers—and a high-dimensional stateless component—updated via SGD or signSGD—thereby jointly ensuring convergence and memory efficiency. The framework further integrates a low-rank selection strategy and provides rigorous theoretical convergence analysis. Under fixed memory budgets, it significantly improves both pretraining and fine-tuning performance, achieving optimal trade-offs between state-of-the-art accuracy and memory efficiency.

Technology Category

Application Category

📝 Abstract
With the increase in the number of parameters in large language models, the process of pre-training and fine-tuning increasingly demands larger volumes of GPU memory. A significant portion of this memory is typically consumed by the optimizer state. To overcome this challenge, recent approaches such as low-rank adaptation (LoRA (Hu et al., 2021)), low-rank gradient projection (GaLore (Zhao et al., 2024)), and blockwise optimization (BAdam (Luo et al., 2024)) have been proposed. However, in all these algorithms, the $ extit{effective rank of the weight updates remains low-rank}$, which can lead to a substantial loss of information from the gradient. This loss can be critically important, especially during the pre-training stage. In this paper, we introduce $ exttt{FRUGAL}$ ($ extbf{F}$ull-$ extbf{R}$ank $ extbf{U}$pdates with $ extbf{G}$r$ extbf{A}$dient sp$ extbf{L}$itting), a new memory-efficient optimization framework. $ exttt{FRUGAL}$ leverages gradient splitting to perform low-dimensional updates using advanced algorithms (such as Adam), while updates along the remaining directions are executed via state-free methods like SGD or signSGD (Bernstein et al., 2018). Our framework can be integrated with various low-rank update selection techniques, including GaLore and BAdam. We provide theoretical convergence guarantees for our framework when using SGDM for low-dimensional updates and SGD for state-free updates. Additionally, our method consistently outperforms concurrent approaches across various fixed memory budgets, achieving state-of-the-art results in pre-training and fine-tuning tasks while balancing memory efficiency and performance metrics.
Problem

Research questions and friction points this paper is trying to address.

Reducing GPU memory usage in large language model training
Addressing information loss from low-rank gradient updates
Balancing memory efficiency and performance in optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gradient splitting for memory efficiency
Combines Adam with state-free methods
Integrates various low-rank update techniques
🔎 Similar Papers
No similar papers found.
Philip Zmushko
Philip Zmushko
Unknown affiliation
deep learningnatural language processingoptimization
Aleksandr Beznosikov
Aleksandr Beznosikov
PhD, Basic Research of Artificial Intelligence Lab
OptimizationMachine Learning
M
Martin Takác
Mohamed bin Zayed University of Artificial Intelligence
S
Samuel Horváth
Mohamed bin Zayed University of Artificial Intelligence