DARE the Extreme: Revisiting Delta-Parameter Pruning For Fine-Tuned Models

📅 2024-10-12

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

221K/year

🤖 AI Summary

To address high storage redundancy in open-source fine-tuned models and large response latency in multi-model deployment, this paper proposes DAREx—a framework that enhances Delta Parameter Pruning (DPP) to overcome DARE’s performance collapse under high compression ratios (>30%) and large delta magnitudes. DAREx introduces three key innovations: (1) DAREx-q optimizes the rescaling factor to preserve accuracy under aggressive compression; (2) DAREx-L2 integrates AdamR preconditioning to suppress anomalous growth of delta parameters; (3) it is the first work to empirically validate that importance-based pruning outperforms random pruning in large-delta regimes and establishes a scenario-adaptive DPP selection pipeline. Experiments on benchmarks including COLA and SST2 demonstrate that DAREx significantly surpasses DARE—especially for decoder-only models—while natively supporting LoRA integration and structured DPP, thereby offering strong practical deployment value.

Technology Category

Application Category

📝 Abstract

Storing open-source fine-tuned models separately introduces redundancy and increases response times in applications utilizing multiple models. Delta-parameter pruning (DPP), particularly the random drop and rescale (DARE) method proposed by Yu et al., addresses this by pruning the majority of delta parameters--the differences between fine-tuned and pre-trained model weights--while typically maintaining minimal performance loss. However, DARE fails when either the pruning rate or the magnitude of the delta parameters is large. We highlight two key reasons for this failure: (1) an excessively large rescaling factor as pruning rates increase, and (2) high mean and variance in the delta parameters. To push DARE's limits, we introduce DAREx (DARE the eXtreme), which features two algorithmic improvements: (1) DAREx-q, a rescaling factor modification that significantly boosts performance at high pruning rates (e.g.,>30 % on COLA and SST2 for encoder models, with even greater gains in decoder models), and (2) DAREx-L2, which combines DARE with AdamR, an in-training method that applies appropriate delta regularization before DPP. We also demonstrate that DAREx-q can be seamlessly combined with vanilla parameter-efficient fine-tuning techniques like LoRA and can facilitate structural DPP. Additionally, we revisit the application of importance-based pruning techniques within DPP, demonstrating that they outperform random-based methods when delta parameters are large. Through this comprehensive study, we develop a pipeline for selecting the most appropriate DPP method under various practical scenarios.

Problem

Research questions and friction points this paper is trying to address.

Reduces redundancy in storing fine-tuned models

Improves delta-parameter pruning for high pruning rates

Enhances performance with algorithmic improvements like DAREx

Innovation

Methods, ideas, or system contributions that make the work stand out.

DAREx-q improves high pruning rate performance

DAREx-L2 combines DARE with AdamR regularization

DAREx integrates with LoRA for structural DPP

🔎 Similar Papers

Step-by-Step Unmasking for Parameter-Efficient Fine-tuning of Large Language Models