Gradient Weight-normalized Low-rank Projection for Efficient LLM Training

πŸ“… 2024-12-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address excessive GPU memory and computational overhead in efficient fine-tuning of large language models (LLMs), this paper proposes Gradient-Weighted Normalization Low-Rank Projection (GW-NLP). The method introduces a novel gradient-weighted normalization mechanism to improve optimization conditioning, synergistically combining low-rank approximation of weights and gradients with 8-bit quantization under a frozen-parameter architecture. Crucially, GW-NLP achieves full fine-tuning–level performance without compromising convergence or inference accuracy. Empirically, it delivers substantial memory efficiency: the 8-bit variant reduces optimizer memory consumption by 89.5%; enables pretraining of LLaMA-7B on a single RTX 4090 GPU; and attains an average GLUE score of 80.65 on RoBERTa, outperforming LoRA by 1.42 points.

Technology Category

Application Category

πŸ“ Abstract
Large Language Models (LLMs) have shown remarkable performance across various tasks, but the escalating demands on computational resources pose significant challenges, particularly in the extensive utilization of full fine-tuning for downstream tasks. To address this, parameter-efficient fine-tuning (PEFT) methods have been developed, but they often underperform compared to full fine-tuning and struggle with memory efficiency. In this work, we introduce Gradient Weight-Normalized Low-Rank Projection (GradNormLoRP), a novel approach that enhances both parameter and memory efficiency while maintaining comparable performance to full fine-tuning. GradNormLoRP normalizes the weight matrix to improve gradient conditioning, facilitating better convergence during optimization. Additionally, it applies low-rank approximations to the weight and gradient matrices, significantly reducing memory usage during training. Extensive experiments demonstrate that our 8-bit GradNormLoRP reduces optimizer memory usage by up to 89.5% and enables the pre-training of large LLMs, such as LLaMA 7B, on consumer-level GPUs like the NVIDIA RTX 4090, without additional inference costs. Moreover, GradNormLoRP outperforms existing low-rank methods in fine-tuning tasks. For instance, when fine-tuning the RoBERTa model on all GLUE tasks with a rank of 8, GradNormLoRP achieves an average score of 80.65, surpassing LoRA's score of 79.23. These results underscore GradNormLoRP as a promising alternative for efficient LLM pre-training and fine-tuning. Source code and Appendix: https://github.com/Jhhuangkay/Gradient-Weight-normalized-Low-rank-Projection-for-Efficient-LLM-Training
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Resource Optimization
Efficient Training
Innovation

Methods, ideas, or system contributions that make the work stand out.

GradNormLoRP
MemoryEfficiency
PerformancePreservation
πŸ”Ž Similar Papers
No similar papers found.