TENP: Trapezoidal Expert Neuron Pruning For Mixture-of-Experts

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

While Mixture-of-Experts (MoE) large models achieve efficient scaling through sparse activation, their static expert modules suffer from excessive parameter counts that hinder deployment, and existing compression methods often compromise routing structure or lack efficiency. This work proposes the TENP framework, which innovatively evaluates expert importance by jointly considering output magnitude and input-direction modulation capacity. Building on neuron-wise projection contributions to the output, TENP employs a trapezoidal structured pruning strategy across shallow to deep layers to preserve critical parameters. Evaluated on the DeepSeek model, TENP retains only 63.76% of activated parameters with 40% routing sparsity, incurring merely a 1-point accuracy drop while outperforming the full-parameter model by 10% on code generation tasks.

📝 Abstract

Mixture-of-Experts large language models (LLMs) scale efficiently through sparse activation, yet their deployment is fundamentally constrained by the large static parameter footprint of experts. Existing compression approaches either remove entire experts, disrupting routing topology and harming performance, or rely on unstructured weight pruning with limited practical efficiency. To address the limitations, we propose TENP, a structured Trapezoidal ExpertNeuron Pruning framework. Using a few samples, we identify and retain important experts, while applying expert neuron pruning (ENP) to less important experts, reserving model parameters in a trapezoidal pattern from shallow to deep layers. When evaluating expert importance, we jointly consider both the magnitude of the expert output and its ability to change the direction of the input vector. For ENP, we measure each neuron's projected contribution to the expert output to identify and retain important neurons. We conduct extensive experiments on the Qwen and DeepSeek models. Under a routing expert sparsity of 40% and an average of 63.76% activated expert parameters, the DeepSeek model suffers only a 1-point drop in accuracy compared to the full-parameter model. Moreover, it outperforms the full-parameter model by 10% on code generation tasks.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

model compression

parameter footprint

expert pruning

sparse activation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts

structured pruning

trapezoidal sparsity