Continual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMs

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of efficiently transforming dense large language models into hardware-friendly channel-sparse architectures through continual training while preserving long-context performance. Building upon Qwen2.5-8B, the authors introduce a sparse SwiGLU feedforward network with low-rank predictor gating, enabling dynamic per-token, per-layer channel routing at 32K context length. A bank-wise top-k strategy achieves 4× sparsity without sacrificing expressivity. The routing module is embedded within the main language modeling pathway and jointly optimized, facilitating end-to-end continual training from dense to sparse configurations. The proposed method maintains competitive performance on standard benchmarks and substantially mitigates the layer-local long-context degradation observed in RULER-CWE, effectively extending the model’s usable context length for long-sequence tasks.

📝 Abstract

We study dense-to-sparse continual training as a way to construct channel-sparse large language models from dense checkpoints. Starting from a Qwen2.5-8B dense backbone, we continue training at 32K context and introduce a predictor-gated sparse SwiGLU FFN in the 32K stage. For each token and layer, we use a low-rank predictor to produce FFN-channel routing logits. We then apply a bank-wise top-k rule to retain 16 channels in every 64-channel bank, yielding 4x sparsity in the FFN intermediate activation. Unlike post-hoc sparse inference methods, the routing module is placed on the main language modeling path and optimized during continual training, enabling the dense model to be upcycled into a hardware-oriented sparse model. We report the architecture, training recipe, benchmark performance, and training lessons. We also identify a layer-local long-context failure mode on RULER-CWE and propose a single-layer repair algorithm that substantially improves the affected length range.

Problem

Research questions and friction points this paper is trying to address.

dense-to-sparse

continual training

channel sparsity

large language models

long-context failure

Innovation

Methods, ideas, or system contributions that make the work stand out.

continual training

predictor-gated sparsity

bank-wise sparsity