Multi-Head LatentMoE and Head Parallel: Communication-Efficient and Deterministic MoE Parallelism

📅 2026-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses key bottlenecks in existing expert parallelism (EP) approaches for training sparse mixture-of-experts (MoE) models, including communication overhead that scales linearly with the number of activated experts, load imbalance, and non-deterministic communication patterns. To overcome these limitations, the authors propose the Multi-Head LatentMoE architecture together with a novel Head Parallel strategy, which— for the first time—achieves O(1) communication complexity, rendering communication costs independent of the number of activated experts while ensuring perfectly balanced and deterministic communication. Combined with I/O-aware routing and expert computation optimizations, the method attains up to 1.61× speedup over conventional EP without compromising model performance; even when expert granularity is doubled, it sustains a 1.11× acceleration, significantly enhancing training efficiency and scalability.

Technology Category

Application Category

📝 Abstract
Large language models have transformed many applications but remain expensive to train. Sparse Mixture of Experts (MoE) addresses this through conditional computation, with Expert Parallel (EP) as the standard distributed training method. However, EP has three limitations: communication cost grows linearly with the number of activated experts $k$, load imbalance affects latency and memory usage, and data-dependent communication requires metadata exchange. We propose Multi-Head LatentMoE and Head Parallel (HP), a new architecture and parallelism achieving $O(1)$ communication cost regardless of $k$, completely balanced traffic, and deterministic communication, all while remaining compatible with EP. To accelerate Multi-Head LatentMoE, we propose IO-aware routing and expert computation. Compared to MoE with EP, Multi-Head LatentMoE with HP trains up to $1.61\times$ faster while having identical performance. With doubled granularity, it achieves higher overall performance while still being $1.11\times$ faster. Our method makes multi-billion-parameter foundation model research more accessible.
Problem

Research questions and friction points this paper is trying to address.

Mixture of Experts
Expert Parallel
communication cost
load imbalance
deterministic communication
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Head LatentMoE
Head Parallel
Mixture of Experts
Communication-Efficient Parallelism
Deterministic Routing
🔎 Similar Papers
No similar papers found.
Chenwei Cui
Chenwei Cui
Arizona State University
Machine Learning
R
Rockwell Jackson
School of Computing and Augmented Intelligence, Arizona State University, Tempe, USA
B
Benjamin Joseph Herrera
School of Computing and Augmented Intelligence, Arizona State University, Tempe, USA
A
Ana Mar'ia T'arano
School of Computing and Augmented Intelligence, Arizona State University, Tempe, USA
H
Hannah Kerner
School of Computing and Augmented Intelligence, Arizona State University, Tempe, USA