Sparse Subspace-to-Expert Sharing for Task-Agnostic Continual Learning

📅 2026-06-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the stability-plasticity dilemma in continual learning of large language models, where acquiring new tasks often leads to catastrophic forgetting of previously learned knowledge. The authors propose SETA, a novel framework that, for the first time, introduces a sparse mixture-of-experts architecture in task-agnostic continual learning. SETA adaptively decomposes model parameters into task-specific and shared experts through subspace disentanglement, and integrates elastic anchoring with routing-aware regularization within a unified gating mechanism to enable both knowledge isolation and reuse. By safeguarding shared knowledge at both parameter and routing levels, SETA substantially outperforms existing methods on LLaMA-2 7B and Qwen3-4B, demonstrating particularly strong performance in preserving early-task knowledge and mitigating backward transfer degradation.

📝 Abstract

Continual learning in Large Language Models (LLMs) is hindered by the plasticity-stability dilemma, where acquiring new capabilities often leads to catastrophic forgetting of previous knowledge. Existing methods typically treat parameters uniformly, failing to distinguish between specific task knowledge and shared capabilities. We introduce Mixture of Sparse Experts for Task Agnostic Continual Learning (SETA), a framework that resolves the plasticity-stability conflict through adaptive sparse subspace decomposition into task-specific expert modules. Unlike standard updates, where tasks compete for the same parameters, SETA separates knowledge into unique experts, designed to isolate task-specific patterns, and shared experts, responsible for capturing common features. This structure is maintained through adaptive elastic anchoring and a routing-aware regularization that jointly protect shared knowledge at both the weight and routing levels and enable a unified gating network to automatically retrieve the correct expert combination during inference. Extensive experiments across diverse domain-specific benchmarks demonstrate that SETA achieves competitive or superior overall performance relative to state-of-the-art continual learning baselines, with particularly strong retention of early-task knowledge and improved backward transfer on LLaMA-2 7B and Qwen3-4B.

Problem

Research questions and friction points this paper is trying to address.

continual learning

catastrophic forgetting

plasticity-stability dilemma

task-agnostic

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Subspace Decomposition

Mixture of Experts

Continual Learning