PaCA: Partial Connection Adaptation for Efficient Fine-Tuning

📅 2025-02-28

📈 Citations: 0

✨ Influential: 0

career value

240K/year

🤖 AI Summary

Existing parameter-efficient fine-tuning (PEFT) methods reduce parameter count and GPU memory usage but suffer from high training latency and large activation memory due to serial execution of adapter layers and pretrained layers. This work proposes **Partially Connected Adaptation (PCA)**, a novel PEFT method that avoids introducing auxiliary adapters; instead, it randomly sparsifies and fine-tunes a subset of connections within the pretrained weights. PCA enables simultaneous gradient sparsity and activation pruning, and supports LoRA-compatible quantization. To our knowledge, this is the first work to apply sparse connection fine-tuning to PEFT. Evaluated on 70B-scale models, PCA achieves comparable MMLU and Oasst1 accuracy to LoRA while reducing training time by 22%, total memory consumption by 16%, increasing maximum sequence length by 23%, and improving throughput by 16%.

Technology Category

Application Category

📝 Abstract

Prior parameter-efficient fine-tuning (PEFT) algorithms reduce memory usage and computational costs of fine-tuning large neural network models by training only a few additional adapter parameters, rather than the entire model. However, the reduction in computational costs due to PEFT does not necessarily translate to a reduction in training time; although the computational costs of the adapter layers are much smaller than the pretrained layers, it is well known that those two types of layers are processed sequentially on GPUs, resulting in significant latency overhead. LoRA and its variants merge low-rank adapter matrices with pretrained weights during inference to avoid latency overhead, but during training, the pretrained weights remain frozen while the adapter matrices are continuously updated, preventing such merging. To mitigate this issue, we propose Partial Connection Adaptation (PaCA), which fine-tunes randomly selected partial connections within the pretrained weights instead of introducing adapter layers in the model. PaCA not only enhances training speed by eliminating the time overhead due to the sequential processing of the adapter and pretrained layers but also reduces activation memory since only partial activations, rather than full activations, need to be stored for gradient computation. Compared to LoRA, PaCA reduces training time by 22% and total memory usage by 16%, while maintaining comparable accuracy across various fine-tuning scenarios, such as fine-tuning on the MMLU dataset and instruction tuning on the Oasst1 dataset. PaCA can also be combined with quantization, enabling the fine-tuning of large models such as LLaMA3.1-70B. In addition, PaCA enables training with 23% longer sequence and improves throughput by 16% on both NVIDIA A100 GPU and INTEL Gaudi2 HPU compared to LoRA. The code is available at https://github.com/WooSunghyeon/paca.

Problem

Research questions and friction points this paper is trying to address.

Reduces training time by fine-tuning partial connections.

Decreases memory usage by storing partial activations.

Enhances throughput and sequence length in training.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tunes partial connections within pretrained weights

Eliminates sequential processing overhead of adapter layers

Reduces training time and memory usage significantly

🔎 Similar Papers

No similar papers found.