Domain-Specific Pruning of Large Mixture-of-Experts Models with Few-shot Demonstrations

📅 2025-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the excessive memory overhead of large-scale Mixture-of-Experts (MoE) models—e.g., DeepSeek-R1-671B—caused by storing all experts, this paper proposes EASY-EP, a lightweight domain-adaptive pruning framework. We first identify and characterize the few-shot expert localization phenomenon in MoE models. Building on this insight, we introduce a dual-mechanism pruning paradigm: (1) output-aware importance scoring, jointly incorporating gating scores and expert output magnitudes; and (2) expert-level token contribution estimation, based on representation similarity before and after routing. Domain-specific few-shot examples guide expert selection. Under 50% expert retention, EASY-EP matches the full-model baseline in accuracy while improving inference throughput by 2.99× under fixed memory budget—achieving a favorable efficiency–accuracy trade-off.

Technology Category

Application Category

📝 Abstract
Mixture-of-Experts (MoE) models achieve a favorable trade-off between performance and inference efficiency by activating only a subset of experts. However, the memory overhead of storing all experts remains a major limitation, especially in large-scale MoE models such as DeepSeek-R1 (671B). In this study, we investigate domain specialization and expert redundancy in large-scale MoE models and uncover a consistent behavior we term few-shot expert localization, with only a few demonstrations, the model consistently activates a sparse and stable subset of experts. Building on this observation, we propose a simple yet effective pruning framework, EASY-EP, that leverages a few domain-specific demonstrations to identify and retain only the most relevant experts. EASY-EP comprises two key components: output-aware expert importance assessment and expert-level token contribution estimation. The former evaluates the importance of each expert for the current token by considering the gating scores and magnitudes of the outputs of activated experts, while the latter assesses the contribution of tokens based on representation similarities after and before routed experts. Experiments show that our method can achieve comparable performances and $2.99 imes$ throughput under the same memory budget with full DeepSeek-R1 with only half the experts. Our code is available at https://github.com/RUCAIBox/EASYEP.
Problem

Research questions and friction points this paper is trying to address.

Reducing memory overhead in large Mixture-of-Experts models
Identifying domain-specific experts with few-shot demonstrations
Pruning irrelevant experts to improve inference efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Domain-specific pruning with few-shot demonstrations
Output-aware expert importance assessment
Expert-level token contribution estimation
🔎 Similar Papers
No similar papers found.
Zican Dong
Zican Dong
Renming University of China
NLPlong text modelingLLM
H
Han Peng
Gaoling School of Artificial Intelligence, Renmin University of China
P
Peiyu Liu
University of International Business and Economics
Wayne Xin Zhao
Wayne Xin Zhao
Professor, Renmin University of China
Recommender SystemNatural Language ProcessingLarge Language Model
D
Dong Wu
YanTron Technology Co. Ltd
F
Feng Xiao
EBTech Co. Ltd
Zhifeng Wang
Zhifeng Wang
Liaoning University
economics