Less is MoE: Trimming Experts in Domain-Specialist Language Models

📅 2026-06-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

212K/year
🤖 AI Summary
Existing methods for compressing Mixture-of-Experts (MoE) models struggle to balance performance and efficiency across general tasks. This work reveals, for the first time, that the core capabilities of MoE models are concentrated in the sparse intermediate dimensions of the feed-forward networks (FFNs). Building on this insight, the authors propose Fisher-MoE, a fine-grained pruning method based on Fisher information to more accurately assess the importance of individual dimensions. Compared to strategies relying on activation magnitudes, routing scores, or weight norms, Fisher-MoE achieves approximately 45% reduction in weight memory and a 21% increase in inference throughput at 50% compression, while effectively preserving model performance on mathematical reasoning and factual knowledge tasks.
📝 Abstract
Mixture-of-Experts (MoE) models achieve strong performance through conditional computation, but their large parameter footprint poses deployment challenges. Prior MoE compression approaches catastrophically fail when evaluated on general-purpose benchmarks beyond commonsense reasoning. We trace this failure to the granularity of compression: important capabilities are distributed across experts but concentrated in FFN sparse intermediate dimensions. To identify these dimensions, we use Fisher importance which outperforms activation-, router-score-, and magnitude-based alternatives, and identifies tiny sets of task-critical dimensions: in Qwen1.5-MoE, removing as few as 12 of 1.35M routed-FFN intermediate dimensions collapses GSM8K accuracy while largely preserving factual-knowledge performance. Building on this, we propose Fisher-MoE, which operates within FFN to remove intermediate dimensions ranked by Fisher importance. At the same 50% MoE compression ratio, Fisher-MoE preserves model capability, while reducing weight memory by ~45% and improving inference throughput by 21%. These findings suggest intermediate dimension granularity is an effective unit for both compression and ranking where capability concentrates in MoE models.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
model compression
capability preservation
intermediate dimensions
Fisher importance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
Fisher importance
model compression
intermediate dimension pruning
conditional computation