Foldable SuperNets: Scalable Merging of Transformers with Different Initializations and Tasks

📅 2024-10-02

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses the challenge of fusing large Transformer models initialized heterogeneously and trained on multiple tasks—where conventional weight-averaging methods fail without shared pretraining. We propose Foldable SuperNet Merge (FS-Merge), a novel model fusion framework that constructs a foldable supernet and incorporates feature-space reconstruction loss. FS-Merge enables efficient knowledge integration across models with disparate initializations, diverse task objectives, and varying widths—without requiring full training data or shared pretraining. It is architecture-agnostic, supporting both MLP and Transformer layers, and is particularly effective in data-scarce regimes. Extensive experiments demonstrate that FS-Merge consistently outperforms knowledge distillation and other baselines across multi-task, multimodal, and cross-scale model fusion settings, achieving state-of-the-art performance—especially under low-data conditions.

Technology Category

Application Category

📝 Abstract

Many recent methods aim to merge neural networks (NNs) with identical architectures trained on different tasks to obtain a single multi-task model. Most existing works tackle the simpler setup of merging NNs initialized from a common pre-trained network, where simple heuristics like weight averaging work well. This work targets a more challenging goal: merging large transformers trained on different tasks from distinct initializations. First, we demonstrate that traditional merging methods fail catastrophically in this setup. To overcome this challenge, we propose Foldable SuperNet Merge (FS-Merge), a method that optimizes a SuperNet to fuse the original models using a feature reconstruction loss. FS-Merge is simple, data-efficient, and capable of merging models of varying widths. We test FS-Merge against existing methods, including knowledge distillation, on MLPs and transformers across various settings, sizes, tasks, and modalities. FS-Merge consistently outperforms them, achieving SOTA results, particularly in limited data scenarios.

Problem

Research questions and friction points this paper is trying to address.

Merging transformers with different initializations and tasks

Overcoming catastrophic failure of traditional merging methods

Achieving data-efficient multi-task model integration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Foldable SuperNet merging transformers

Feature reconstruction with frozen weights

Folding SuperNet to single model

🔎 Similar Papers

No similar papers found.