🤖 AI Summary
To address the prohibitive computational cost of full-parameter fine-tuning for large language models and the performance bottlenecks of existing parameter-efficient fine-tuning (PEFT) methods—particularly their inherent rank limitations—this paper proposes Row-Column Diagonal Scaling (RCDS). RCDS introduces learnable diagonal scaling matrices applied independently to the rows and columns of pretrained weight matrices, enabling high-rank adaptation of an $n imes m$ matrix with only $n + m$ trainable parameters. Theoretically, RCDS achieves a significantly higher upper bound on update rank than Low-Rank Adaptation (LoRA). Empirically, on GLUE, arithmetic reasoning, and commonsense reasoning benchmarks, RCDS matches or approaches the performance of full fine-tuning and state-of-the-art PEFT methods on a 14B-parameter model, while reducing trainable parameters by one to three orders of magnitude—demonstrating both exceptional parameter efficiency and strong representational capacity.
📝 Abstract
Foundation models excel across diverse tasks, but adapting them to specialized applications often requires fine-tuning, an approach that is memory and compute-intensive. Parameter-efficient fine-tuning (PEFT) methods mitigate this by updating only a small subset of weights. In this paper, we introduce HyperAdapt, a parameter-efficient fine-tuning method that significantly reduces the number of trainable parameters compared to state-of-the-art methods like LoRA. Specifically, HyperAdapt adapts a pre-trained weight matrix by applying row- and column-wise scaling through diagonal matrices, thereby inducing a high-rank update while requiring only $n+m$ trainable parameters for an $n imes m$ matrix. Theoretically, we establish an upper bound on the rank of HyperAdapt's updates, and empirically, we confirm that it consistently induces high-rank transformations across model layers. Experiments on GLUE, arithmetic reasoning, and commonsense reasoning benchmarks with models up to 14B parameters demonstrate that HyperAdapt matches or nearly matches the performance of full fine-tuning and state-of-the-art PEFT methods while using orders of magnitude fewer trainable parameters.