Accelerating MoE Model Inference with Expert Sharding

πŸ“… 2025-03-11
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address low hardware utilization and high communication overhead caused by imbalanced expert routing in encoder-style Mixture-of-Experts (MoE) model inference on multi-GPU systems, this paper proposes MoEShardβ€”a novel inference system. Its core innovation is a first-of-its-kind row-column joint matrix decomposition strategy that enables fine-grained sharding of expert weight tensors. This approach achieves perfect expert load balancing without capacity factors and without discarding any tokens, ensuring full-token processing and zero compute idle time. MoEShard further integrates optimized compute kernels with efficient expert-parallel communication primitives to minimize latency and communication overhead. Experimental results on encoder MoE models demonstrate that MoEShard reduces time-to-first-token (TTFT) by up to 6.4Γ— compared to DeepSpeed, while significantly improving GPU utilization and end-to-end throughput.

Technology Category

Application Category

πŸ“ Abstract
Mixture of experts (MoE) models achieve state-of-the-art results in language modeling but suffer from inefficient hardware utilization due to imbalanced token routing and communication overhead. While prior work has focused on optimizing MoE training and decoder architectures, inference for encoder-based MoE models in a multi-GPU with expert parallelism setting remains underexplored. We introduce MoEShard, an inference system that achieves perfect load balancing through tensor sharding of MoE experts. Unlike existing approaches that rely on heuristic capacity factors or drop tokens, MoEShard evenly distributes computation across GPUs and ensures full token retention, maximizing utilization regardless of routing skewness. We achieve this through a strategic row- and column-wise decomposition of expert matrices. This reduces idle time and avoids bottlenecks caused by imbalanced expert assignments. Furthermore, MoEShard minimizes kernel launches by fusing decomposed expert computations, significantly improving throughput. We evaluate MoEShard against DeepSpeed on encoder-based architectures, demonstrating speedups of up to 6.4$ imes$ in time to first token (TTFT). Our results show that tensor sharding, when properly applied to experts, is a viable and effective strategy for efficient MoE inference.
Problem

Research questions and friction points this paper is trying to address.

Improves hardware utilization in MoE model inference
Addresses imbalanced token routing and communication overhead
Enhances throughput and reduces idle time in multi-GPU settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tensor sharding for perfect load balancing
Row- and column-wise expert matrix decomposition
Fused expert computations to minimize kernel launches
πŸ”Ž Similar Papers
No similar papers found.