π€ AI Summary
This work addresses the inefficiencies in Mixture-of-Experts (MoE) model inference caused by imbalanced expert loads, which lead to resource underutilization and reduced throughput, while existing expert replication strategies often incur excessive GPU memory overhead due to over-replication. To tackle this, the authors propose CRAFT, a novel framework that introduces, for the first time, a fine-grained, cost-aware, layer-wise expert replication mechanism. CRAFT dynamically optimizes replication decisions based on the marginal benefit of replicating each layerβs experts under a given memory budget, without requiring any modifications to the model architecture or training pipeline. By integrating load analysis with memory-constrained optimization, CRAFT seamlessly fits into existing MoE serving systems. Evaluated on large-scale deployments of models ranging from hundreds of billions to trillions of parameters, the approach achieves an average end-to-end throughput improvement of 1.14Γ, with peak gains reaching 1.2Γ.
π Abstract
Mixture-of-Experts (MoE) has recently emerged as the mainstream architecture for efficiently scaling large language models while maintaining near-constant computational cost. Expert parallelism distributes parameters by partitioning experts across devices, but this introduces token-level load imbalance during inference. Expert replication is a widely adopted load-balancing technique in serving frameworks that alleviates load imbalance in large-scale deployments by replicating experts with high loads. In this work, we demonstrate that existing replication schemes often over-replicate, with many replicas providing marginal improvement. Replicas consume substantial GPU memory, which may lead to resource contention and throughput degradation. We present CRAFT, an efficient expert replication framework that maximizes load balance under a given memory budget by performing fine-grained, per-layer replication based on the estimated replication benefit. CRAFT can be seamlessly integrated into existing serving frameworks without any additional training or model changes. Our evaluation shows that CRAFT increases end-to-end serving throughput by $1.14\times$ on average (up to $1.2\times$) over existing replication techniques in large-scale deployments with models ranging from hundreds of billions to a trillion parameters.