MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism

📅 2025-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
MoE model inference suffers from memory-intensive feed-forward network (FFN) modules, low GPU utilization, and high serving costs due to sparse activation. This paper proposes a modular decoupled inference architecture that separates attention and FFN computation across devices, enabling module-level decoupling and ping-pong pipelined parallelism. Integrated with heterogeneous hardware-aware scheduling and a zero-copy M2N communication library, the approach overcomes traditional MoE bottlenecks in inter-module communication overhead and scheduling granularity. To our knowledge, it is the first to realize fine-grained, module-level offloading and coordinated pipelining, substantially alleviating GPU memory pressure. Experiments demonstrate that our method achieves 1.90× higher single-GPU throughput than state-of-the-art baselines, while significantly reducing end-to-end latency and per-token inference cost.

Technology Category

Application Category

📝 Abstract
Mixture-of-Experts (MoE) showcases tremendous potential to scale large language models (LLMs) with enhanced performance and reduced computational complexity. However, its sparsely activated architecture shifts feed-forward networks (FFNs) from being compute-intensive to memory-intensive during inference, leading to substantially lower GPU utilization and increased operational costs. We present MegaScale-Infer, an efficient and cost-effective system for serving large-scale MoE models. MegaScale-Infer disaggregates attention and FFN modules within each model layer, enabling independent scaling, tailored parallelism strategies, and heterogeneous deployment for both modules. To fully exploit disaggregation in the presence of MoE's sparsity, MegaScale-Infer introduces ping-pong pipeline parallelism, which partitions a request batch into micro-batches and shuttles them between attention and FFNs for inference. Combined with distinct model parallelism for each module, MegaScale-Infer effectively hides communication overhead and maximizes GPU utilization. To adapt to disaggregated attention and FFN modules and minimize data transmission overhead (e.g., token dispatch), MegaScale-Infer provides a high-performance M2N communication library that eliminates unnecessary GPU-to-CPU data copies, group initialization overhead, and GPU synchronization. Experimental results indicate that MegaScale-Infer achieves up to 1.90x higher per-GPU throughput than state-of-the-art solutions.
Problem

Research questions and friction points this paper is trying to address.

Improving GPU utilization in sparse MoE models
Reducing operational costs of large MoE inference
Optimizing communication for disaggregated attention-FFN modules
Innovation

Methods, ideas, or system contributions that make the work stand out.

Disaggregates attention and FFN modules for scaling
Uses ping-pong pipeline parallelism for sparsity
Implements M2N communication to reduce overhead
🔎 Similar Papers
No similar papers found.
Ruidong Zhu
Ruidong Zhu
Peking University
Machine Learning SystemsComputer SystemsDistributed Systems
Ziheng Jiang
Ziheng Jiang
Research Scientist, ByteDance
SystemsMachine Learning
C
Chao Jin
ByteDance Seed, Peking University
P
Peng Wu
ByteDance Seed
Cesar A. Stuardo
Cesar A. Stuardo
ByteDance Seed
D
Dongyang Wang
ByteDance Seed
Xinlei Zhang
Xinlei Zhang
ByteDance Seed
H
Huaping Zhou
ByteDance Seed
H
Haoran Wei
ByteDance Seed
Y
Yang Cheng
ByteDance Seed
J
Jianzhe Xiao
ByteDance Seed
X
Xinyi Zhang
ByteDance Seed
L
Lingjun Liu
ByteDance Seed
Haibin Lin
Haibin Lin
Bytedance
Machine Learning SystemsNatural Language Processing
Li-Wen Chang
Li-Wen Chang
Research Scientist, ByteDance
High Performance ComputingCompilerComputer ArchitectureAlgorithmsDeep Learning
J
Jianxi Ye
ByteDance Seed
X
Xiao Yu
ByteDance Seed
Xuanzhe Liu
Xuanzhe Liu
Boya Distinguished Professor, Peking University, ACM Distinguished Scientist
Machine Learning SystemMobile Computing SystemServerless Computing
X
Xin Jin
Peking University
X
Xin Liu
ByteDance Seed