Dynamic Expert Specialization: Towards Catastrophic Forgetting-Free Multi-Domain MoE Adaptation

📅 2025-09-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address catastrophic forgetting, high computational overhead, and severe cross-domain interference in multi-domain adaptation of Mixture-of-Experts (MoE) models, this paper proposes DES-MoE—a Dynamic Expert Specialization framework. DES-MoE employs a three-stage frozen fine-tuning strategy, knowledge-distillation-enhanced dynamic routing, and real-time updated expert-to-domain association mapping to enable on-demand expert isolation and collaborative evolution. Its core contribution is achieving joint multi-domain optimization within a single MoE architecture—without domain-specific models or additional inference parameters. Experiments across six domains—including mathematics, programming, and law—demonstrate that DES-MoE matches the performance of full fine-tuning per domain, reduces forgetting by 89%, and accelerates convergence by 68%, significantly outperforming state-of-the-art sparse adaptation and continual learning methods.

Technology Category

Application Category

📝 Abstract
Mixture-of-Experts (MoE) models offer immense capacity via sparsely gated expert subnetworks, yet adapting them to multiple domains without catastrophic forgetting remains an open challenge. Existing approaches either incur prohibitive computation, suffer cross-domain interference, or require separate runs per domain. We propose DES-MoE, a dynamic expert specialization framework for multi-domain adaptation of Mixture-of-Experts models. DES-MoE addresses catastrophic forgetting through three innovations: (1) an adaptive router balancing pre-trained knowledge retention and task-specific updates via distillation, (2) real-time expert-domain correlation mapping to isolate domain-specific gradients, and (3) a three-phase adaptive fine-tuning schedule that progressively freezes non-specialized parameters. Evaluated on six domains (math, code, law, etc.), DES-MoE matches single-domain ESFT performance while training one unified model, reduces forgetting by 89% compared to full fine-tuning as domains scale from 2 to 6, and achieves 68% faster convergence than conventional methods. Our work establishes dynamic expert isolation as a scalable paradigm for multi-task MoE adaptation.
Problem

Research questions and friction points this paper is trying to address.

Adapting Mixture-of-Experts models to multiple domains without catastrophic forgetting
Addressing cross-domain interference and prohibitive computation in multi-domain adaptation
Achieving scalable multi-task adaptation while retaining pre-trained knowledge
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive router balancing knowledge retention via distillation
Real-time expert-domain correlation mapping for gradient isolation
Three-phase adaptive fine-tuning schedule with progressive freezing
🔎 Similar Papers
No similar papers found.