Layer-adaptive Expert Pruning for Pre-Training of Mixture-of-Experts Large Language Models

📅 2026-01-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of low expert utilization and computational inefficiency in pretraining large Mixture-of-Experts (MoE) language models by proposing a layer-adaptive expert pruning algorithm. The method introduces, for the first time, a dynamic sparsification mechanism during pretraining that evaluates expert utilization based on token distribution and adaptively prunes redundant experts on a per-layer basis. It further optimizes computational resource allocation through cross-device expert reassignment. When applied to training a 101B-parameter base model from scratch, the approach achieves a 48.3% improvement in training efficiency and a 33.3% reduction in model parameters, while maintaining strong performance across multiple downstream tasks.

Technology Category

Application Category

📝 Abstract
Although Mixture-of-Experts (MoE) Large Language Models (LLMs) deliver superior accuracy with a reduced number of active parameters, their pre-training represents a significant computationally bottleneck due to underutilized experts and limited training efficiency. This work introduces a Layer-Adaptive Expert Pruning (LAEP) algorithm designed for the pre-training stage of MoE LLMs. In contrast to previous expert pruning approaches that operate primarily in the post-training phase, the proposed algorithm enhances training efficiency by selectively pruning underutilized experts and reorganizing experts across computing devices according to token distribution statistics. Comprehensive experiments demonstrate that LAEP effectively reduces model size and substantially improves pre-training efficiency. In particular, when pre-training the Yuan3.0-1T Base model from scratch original with 1515B parameters, LAEP achieves a 48.3% improvement in training efficiency alongside a 33.3% parameter reduction, while still delivering excellent performance across multiple domains.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
Large Language Models
pre-training
training efficiency
expert pruning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Layer-Adaptive Expert Pruning
Mixture-of-Experts
Pre-Training Efficiency
Expert Pruning
Large Language Models
🔎 Similar Papers
No similar papers found.
Y
YuanLab.ai
S
Shawn Wu
J
Jiangang Luo
Tong Yu
Tong Yu
Adobe Research
D
Darcy Chen
Sean Wang
Sean Wang
Southern Methodist University, Cox School of Business, Accounting
BiasDiscriminationCapital MarketsInformation ProcessingFinancial Analysts
X
Xudong Zhao
L
Louie Li
C
Claire Wang
H
Hunter He
C
Carol Wang
A
Allen Wang