CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inference

📅 2025-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) suffer from high inference latency and computational cost, while their feed-forward networks (FFNs) exhibit parameter redundancy alongside highly sparse neuron activations. Method: This paper proposes CMoE, a “carving-style” Mixture-of-Experts (MoE) construction framework that requires no from-scratch training. It automatically groups neurons into experts based on activation rates, introduces a differentiable, training-free routing mechanism with explicit load-balancing constraints, and incorporates lightweight fine-tuning. Contribution/Results: CMoE achieves the first-ever rapid conversion of dense models (e.g., 7B) into structured MoE architectures—within minutes (≈5 min) for conversion and under one hour for high-fidelity performance recovery. It significantly reduces inference FLOPs while preserving original accuracy. Crucially, this paradigm bypasses the traditional MoE bottleneck of heavy data and compute requirements, offering a novel, resource-efficient pathway for deploying LLMs.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) achieve impressive performance by scaling model parameters, but this comes with significant inference overhead. Feed-forward networks (FFNs), which dominate LLM parameters, exhibit high activation sparsity in hidden neurons. To exploit this, researchers have proposed using a mixture-of-experts (MoE) architecture, where only a subset of parameters is activated. However, existing approaches often require extensive training data and resources, limiting their practicality. We propose CMoE (Carved MoE), a novel framework to efficiently carve MoE models from dense models. CMoE achieves remarkable performance through efficient expert grouping and lightweight adaptation. First, neurons are grouped into shared and routed experts based on activation rates. Next, we construct a routing mechanism without training from scratch, incorporating a differentiable routing process and load balancing. Using modest data, CMoE produces a well-designed, usable MoE from a 7B dense model within five minutes. With lightweight fine-tuning, it achieves high-performance recovery in under an hour. We make our code publicly available at https://github.com/JarvisPei/CMoE.
Problem

Research questions and friction points this paper is trying to address.

Efficiently carves MoE models
Minimizes training data usage
Reduces LLM inference overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

Efficient expert grouping
Differentiable routing process
Lightweight adaptation fine-tuning