π€ AI Summary
To address excessive GPU memory and computational overhead in efficient fine-tuning of large language models (LLMs), this paper proposes Gradient-Weighted Normalization Low-Rank Projection (GW-NLP). The method introduces a novel gradient-weighted normalization mechanism to improve optimization conditioning, synergistically combining low-rank approximation of weights and gradients with 8-bit quantization under a frozen-parameter architecture. Crucially, GW-NLP achieves full fine-tuningβlevel performance without compromising convergence or inference accuracy. Empirically, it delivers substantial memory efficiency: the 8-bit variant reduces optimizer memory consumption by 89.5%; enables pretraining of LLaMA-7B on a single RTX 4090 GPU; and attains an average GLUE score of 80.65 on RoBERTa, outperforming LoRA by 1.42 points.
π Abstract
Large Language Models (LLMs) have shown remarkable performance across various tasks, but the escalating demands on computational resources pose significant challenges, particularly in the extensive utilization of full fine-tuning for downstream tasks. To address this, parameter-efficient fine-tuning (PEFT) methods have been developed, but they often underperform compared to full fine-tuning and struggle with memory efficiency. In this work, we introduce Gradient Weight-Normalized Low-Rank Projection (GradNormLoRP), a novel approach that enhances both parameter and memory efficiency while maintaining comparable performance to full fine-tuning. GradNormLoRP normalizes the weight matrix to improve gradient conditioning, facilitating better convergence during optimization. Additionally, it applies low-rank approximations to the weight and gradient matrices, significantly reducing memory usage during training. Extensive experiments demonstrate that our 8-bit GradNormLoRP reduces optimizer memory usage by up to 89.5% and enables the pre-training of large LLMs, such as LLaMA 7B, on consumer-level GPUs like the NVIDIA RTX 4090, without additional inference costs. Moreover, GradNormLoRP outperforms existing low-rank methods in fine-tuning tasks. For instance, when fine-tuning the RoBERTa model on all GLUE tasks with a rank of 8, GradNormLoRP achieves an average score of 80.65, surpassing LoRA's score of 79.23. These results underscore GradNormLoRP as a promising alternative for efficient LLM pre-training and fine-tuning. Source code and Appendix: https://github.com/Jhhuangkay/Gradient-Weight-normalized-Low-rank-Projection-for-Efficient-LLM-Training