FlowMoE: A Scalable Pipeline Scheduling Framework for Distributed Mixture-of-Experts Training

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In distributed Mixture-of-Experts (MoE) training, computation and communication scheduling are traditionally decoupled, neglecting holistic optimization across critical operations—including multi-head attention (MHA), gating, and all-reduce. This paper introduces FlowMoE, the first framework to unify scheduling of heterogeneous tasks: MHA, gating, expert computation, and all-to-all/all-reduce communication. FlowMoE employs a tensor-blocking–driven priority scheduler that tightly integrates pipeline parallelism with computation-communication overlap. Implemented as an adaptive, general-purpose library atop PyTorch, it supports diverse MoE architectures and hardware configurations. Extensive experiments demonstrate consistent improvements: 13–57% reduction in training time, 10–39% lower energy consumption, and 7–32% decreased memory footprint. These gains significantly enhance both training efficiency and resource utilization in distributed MoE systems.

Technology Category

Application Category

📝 Abstract
The parameter size of modern large language models (LLMs) can be scaled up via the sparsely-activated Mixture-of-Experts (MoE) technique to avoid excessive increase of the computational costs. To further improve training efficiency, pipelining computation and communication has become a promising solution for distributed MoE training. However, existing work primarily focuses on scheduling tasks within the MoE layer, such as expert computing and all-to-all (A2A) communication, while neglecting other key operations including multi-head attention (MHA) computing, gating, and all-reduce communication. In this paper, we propose FlowMoE, a scalable framework for scheduling multi-type task pipelines. First, FlowMoE constructs a unified pipeline to consistently scheduling MHA computing, gating, expert computing, and A2A communication. Second, FlowMoE introduces a tensor chunk-based priority scheduling mechanism to overlap the all-reduce communication with all computing tasks. We implement FlowMoE as an adaptive and generic framework atop PyTorch. Extensive experiments with 675 typical MoE layers and four real-world MoE models across two GPU clusters demonstrate that our proposed FlowMoE framework outperforms state-of-the-art MoE training frameworks, reducing training time by 13%-57%, energy consumption by 10%-39%, and memory usage by 7%-32%.
Problem

Research questions and friction points this paper is trying to address.

Scheduling multi-type task pipelines for distributed MoE training
Overlapping all-reduce communication with all computing tasks
Improving efficiency of sparse large language model training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified pipeline scheduling for multi-type tasks
Tensor chunk-based priority scheduling mechanism
Overlapping all-reduce communication with computing tasks
🔎 Similar Papers
No similar papers found.