MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing

📅 2025-02-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

MoE models suffer from load imbalance and skewed inter-device communication in multi-GPU inference, resulting in high tail latency and low throughput. To address this, we propose a globally optimized expert assignment method that, for the first time, formulates inter-layer token routing dependencies as integer linear programming (ILP) constraints. Our approach jointly optimizes computational load balancing and communication overhead, overcoming the dual imbalance—both in load distribution and communication cost—that plagues conventional expert parallelism. The method supports both single-node and multi-node deployments. In end-to-end inference evaluations, it achieves 9.3% and 17.5% speedups, respectively, while significantly reducing tail latency and improving throughput.

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) model architecture has emerged as a promising solution for scaling transformer models efficiently, offering sparse activation that reduces computational costs while increasing model capacity. However, as MoE models scale, they need to be distributed across GPU devices, thus face critical performance bottlenecks due to their large memory footprint. Expert parallelism distributes experts across GPUs, however, faces key challenges including an unbalanced token routing and expert activation, resulting in communication tail latency and processing inefficiencies. While existing solutions address some of these issues, they fail to resolve the dual challenges of load imbalance and communication skew. The imbalance in token processing load across experts causes uneven processing times on different GPUs, while communication skew between GPUs leads to unbalanced inter-GPU data transfers. These factors degrade the performance of MoE models by increasing tail latency and reducing overall throughput. To address these limitations, we propose an Integer Linear Programming (ILP) formulation to optimize expert placement by jointly considering token load, communication, and computation costs. We exploit the property that there is a token routing dependency across layers, where tokens routed to a specific expert in one layer are likely to be routed to a limited set of experts in the subsequent layer. Our solution, MoETuner, offers an optimal expert-to-GPU assignment that minimizes inter-GPU token routing costs and balances token processing across devices, thereby reducing tail latency and end-to-end execution time. Experimental results demonstrate 9.3% and 17.5% of end-to-end speedups for single-node and multi-node inference respectively, showcasing the potential of our ILP-based optimization for offering expert parallel solutions for next-generation MoEs.

Problem

Research questions and friction points this paper is trying to address.

Optimize expert placement in MoE models

Balance token routing across GPU devices

Reduce communication and processing inefficiencies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integer Linear Programming optimization

Balanced expert-to-GPU assignment

Minimized inter-GPU token routing

🔎 Similar Papers

No similar papers found.

Authors to Follow