🤖 AI Summary
Large language model training faces two key challenges: excessive memory overhead from optimizer states and gradient information loss in low-rank update methods (e.g., LoRA, GaLore, BAdam). To address these, this paper proposes a memory-efficient optimization framework that synergistically combines full-rank updates with gradient splitting. Its core innovation is a novel gradient splitting mechanism: gradients are decomposed into a low-dimensional adaptive component—updated via Adam-style optimizers—and a high-dimensional stateless component—updated via SGD or signSGD—thereby jointly ensuring convergence and memory efficiency. The framework further integrates a low-rank selection strategy and provides rigorous theoretical convergence analysis. Under fixed memory budgets, it significantly improves both pretraining and fine-tuning performance, achieving optimal trade-offs between state-of-the-art accuracy and memory efficiency.
📝 Abstract
With the increase in the number of parameters in large language models, the process of pre-training and fine-tuning increasingly demands larger volumes of GPU memory. A significant portion of this memory is typically consumed by the optimizer state. To overcome this challenge, recent approaches such as low-rank adaptation (LoRA (Hu et al., 2021)), low-rank gradient projection (GaLore (Zhao et al., 2024)), and blockwise optimization (BAdam (Luo et al., 2024)) have been proposed. However, in all these algorithms, the $ extit{effective rank of the weight updates remains low-rank}$, which can lead to a substantial loss of information from the gradient. This loss can be critically important, especially during the pre-training stage. In this paper, we introduce $ exttt{FRUGAL}$ ($ extbf{F}$ull-$ extbf{R}$ank $ extbf{U}$pdates with $ extbf{G}$r$ extbf{A}$dient sp$ extbf{L}$itting), a new memory-efficient optimization framework. $ exttt{FRUGAL}$ leverages gradient splitting to perform low-dimensional updates using advanced algorithms (such as Adam), while updates along the remaining directions are executed via state-free methods like SGD or signSGD (Bernstein et al., 2018). Our framework can be integrated with various low-rank update selection techniques, including GaLore and BAdam. We provide theoretical convergence guarantees for our framework when using SGDM for low-dimensional updates and SGD for state-free updates. Additionally, our method consistently outperforms concurrent approaches across various fixed memory budgets, achieving state-of-the-art results in pre-training and fine-tuning tasks while balancing memory efficiency and performance metrics.