🤖 AI Summary
This work addresses the severe device-level load imbalance in large-scale Expert Parallelism (EP) for Mixture-of-Experts (MoE) models, which causes straggling computation, communication bottlenecks, and memory spikes—challenges inadequately mitigated by existing periodic balancing strategies under non-stationary workloads. To this end, we propose UltraEP, the first rack-scale real-time precise load-balancing system that dynamically rebalances experts at every microbatch and critical-path layer. UltraEP integrates quota-driven instantaneous scheduling with efficient expert state migration via persistent chunked streaming, drastically reducing communication overhead. Evaluated on MoE models ranging from 106B to 671B parameters, UltraEP achieves 94.3% of ideal throughput—1.49× faster than unbalanced execution—and reduces inter-GPU load imbalance from 1.30–4.01 to 1.01–1.04, demonstrating strong scalability and robustness in a production environment with 2,560 GPUs.
📝 Abstract
Large-scale expert parallelism (EP) is becoming pivotal for training and serving frontier MoE models, but it also amplifies device-level expert load imbalance into compute stragglers, token all-to-all bottlenecks, and activation-memory spikes. Existing balancers redistribute experts periodically based on historical load, which becomes unreliable for production deployments with non-stationary load patterns.
We present UltraEP, the first exact-load, real-time balancer for large-EP MoE training and serving prefill on rack-scale nodes (RSNs). Built upon the extended scale-up connectivity of RSNs, UltraEP rebalances every microbatch and layer on critical paths, which requires nontrivial co-design of plan solving and expert replication communication to minimize exposed overhead. To this end, UltraEP eagerly reacts to post-gating load with efficient quota-driven planning, and executes the resulting irregular expert-state transfers with RSN-native persistent tile streaming and relay-based fan-out mitigation. Averaged across MoE models from 106B to 671B parameters in training and prefill, UltraEP achieves 94.3% of the force-balanced ideal throughput, delivering 1.49$\times$ improvement over non-balancing, while reducing the final inter-rank imbalance from 1.30$-$4.01 to 1.01$-$1.04. Additionally, we validate UltraEP's scalability and robustness in production MoE training with 2560 GPUs.