UltraEP: Unleash MoE Training and Inference on Rack-Scale Nodes with Near-Optimal Load Balancing

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

This work addresses the severe device-level load imbalance in large-scale Expert Parallelism (EP) for Mixture-of-Experts (MoE) models, which causes straggling computation, communication bottlenecks, and memory spikes—challenges inadequately mitigated by existing periodic balancing strategies under non-stationary workloads. To this end, we propose UltraEP, the first rack-scale real-time precise load-balancing system that dynamically rebalances experts at every microbatch and critical-path layer. UltraEP integrates quota-driven instantaneous scheduling with efficient expert state migration via persistent chunked streaming, drastically reducing communication overhead. Evaluated on MoE models ranging from 106B to 671B parameters, UltraEP achieves 94.3% of ideal throughput—1.49× faster than unbalanced execution—and reduces inter-GPU load imbalance from 1.30–4.01 to 1.01–1.04, demonstrating strong scalability and robustness in a production environment with 2,560 GPUs.

📝 Abstract

Large-scale expert parallelism (EP) is becoming pivotal for training and serving frontier MoE models, but it also amplifies device-level expert load imbalance into compute stragglers, token all-to-all bottlenecks, and activation-memory spikes. Existing balancers redistribute experts periodically based on historical load, which becomes unreliable for production deployments with non-stationary load patterns. We present UltraEP, the first exact-load, real-time balancer for large-EP MoE training and serving prefill on rack-scale nodes (RSNs). Built upon the extended scale-up connectivity of RSNs, UltraEP rebalances every microbatch and layer on critical paths, which requires nontrivial co-design of plan solving and expert replication communication to minimize exposed overhead. To this end, UltraEP eagerly reacts to post-gating load with efficient quota-driven planning, and executes the resulting irregular expert-state transfers with RSN-native persistent tile streaming and relay-based fan-out mitigation. Averaged across MoE models from 106B to 671B parameters in training and prefill, UltraEP achieves 94.3% of the force-balanced ideal throughput, delivering 1.49$\times$ improvement over non-balancing, while reducing the final inter-rank imbalance from 1.30$-$4.01 to 1.01$-$1.04. Additionally, we validate UltraEP's scalability and robustness in production MoE training with 2560 GPUs.

Problem

Research questions and friction points this paper is trying to address.

expert parallelism

load imbalance

MoE models

rack-scale nodes

non-stationary load

Innovation

Methods, ideas, or system contributions that make the work stand out.

UltraEP

Mixture-of-Experts (MoE)

Expert Parallelism