🤖 AI Summary
This work addresses the inefficiencies in existing optical interconnect scheduling methods for Mixture-of-Experts (MoE) models, which optimize communication in isolation and neglect coordination with subsequent expert computation, leading to scheduling bubbles and suboptimal performance. To tackle the all-to-all communication scheduling challenge, the paper proposes a novel strategy that jointly considers communication-computation overlap and batch size, deliberately avoiding traditional Birkhoff-von Neumann (BvN) decomposition to prevent execution fragmentation. The approach employs a greedy maximum-weight matching algorithm with a constrained number of matchings for circuit scheduling, integrated with an MoE communication model tailored to optical interconnect architectures. This design significantly reduces scheduling bubbles and computational overhead, achieves near-ideal congestion-free communication performance, and substantially enhances overall MoE execution efficiency while maintaining large-batch processing capabilities.
📝 Abstract
The growing demand for efficient communication in distributed training and inference has sparked significant interest in reconfigurable photonic interconnects across both academia and industry. Mixture-of-Experts (MoE) models, with their highly skewed communication patterns, present a natural opportunity for such circuit-switched fabrics. However, existing approaches largely optimize communication in isolation, overlooking the interaction between communication and the expert computation that follows.
In this paper, we revisit circuit scheduling for all-to-all communication in MoE execution. We show that the dispatch--compute--combine structure fundamentally challenges classical scheduling techniques such as Birkhoff--von Neumann (BvN) decomposition. First, MoE communication matrices are rarely doubly stochastic, introducing significant scheduling bubbles in BvN-based schedules. Second, while decomposition enables communication--compute overlap, the excessive number of matchings produced by BvN fragments execution into small batches, leading to severe compute inefficiencies due to fixed execution overheads. Motivated by these observations, we explore a simple greedy max-weight decomposition strategy that bounds the number of matchings while preserving large batch sizes per matching. Despite its simplicity, the approach significantly improves overlap efficiency, reduces compute overheads, and approaches the performance of an ideal congestion-free all-to-all.