Exploring Sparse Adapters for Scalable Merging of Parameter Efficient Experts

📅 2025-07-08

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the challenges of multi-task adaptation and merging in parameter-efficient fine-tuning. We propose a modular architecture based on sparse adapters, integrating low-rank decomposition with structured weight sparsity to enable task-specific adapter training with minimal parameter updates—eliminating the need for downstream fine-tuning and supporting plug-and-play multi-task merging. Our key contribution is a streamlined, more efficient sparse training mechanism, both conceptually and empirically superior to LoRA and full-parameter fine-tuning. Experiments demonstrate the first successful joint merging of adapters across 20 diverse NLP tasks, yielding substantial gains in in-distribution performance. The approach consistently outperforms LoRA and full-model merging baselines in multi-task merging scenarios. However, cross-task generalization remains limited and warrants further investigation.

Technology Category

Application Category

📝 Abstract

Merging parameter-efficient task experts has recently gained growing attention as a way to build modular architectures that can be rapidly adapted on the fly for specific downstream tasks, without requiring additional fine-tuning. Typically, LoRA serves as the foundational building block of such parameter-efficient modular architectures, leveraging low-rank weight structures to reduce the number of trainable parameters. In this paper, we study the properties of sparse adapters, which train only a subset of weights in the base neural network, as potential building blocks of modular architectures. First, we propose a simple method for training highly effective sparse adapters, which is conceptually simpler than existing methods in the literature and surprisingly outperforms both LoRA and full fine-tuning in our setting. Next, we investigate the merging properties of these sparse adapters by merging adapters for up to 20 natural language processing tasks, thus scaling beyond what is usually studied in the literature. Our findings demonstrate that sparse adapters yield superior in-distribution performance post-merging compared to LoRA or full model merging. Achieving strong held-out performance remains a challenge for all methods considered.

Problem

Research questions and friction points this paper is trying to address.

Exploring sparse adapters for scalable expert merging

Comparing sparse adapters with LoRA and full fine-tuning

Investigating merging properties across multiple NLP tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse adapters train subset of base network weights

Proposed method outperforms LoRA and full fine-tuning

Scalable merging of adapters for 20 NLP tasks

🔎 Similar Papers

MoPE: Mixture of Prompt Experts for Parameter-Efficient and Scalable Multimodal Fusion