π€ AI Summary
This work addresses the high communication overhead and routing load imbalance in distributed inference of multitask sparse-activated Mixture-of-Experts (MoE) models, which stem from existing task-agnostic global aggregation strategies that overlook task-specific expert co-activation patterns. The authors propose a Task-Aware Expert Clustering framework (TACG) that constructs an expert co-activation graph by modeling collaborative activation trajectories across task families and allocates experts to GPUs under capacity constraints. Additionally, a Generalist Expert Shared Replication (GESR) mechanism is introduced to handle online load shifts. By incorporating task-aware co-activation into MoE deployment optimization for the first time, the approach reduces communication overhead by 31.39% on average across three open-source models, achieves a Jainβs fairness index of 0.9975, and significantly outperforms strong baselines under severe inference distribution shifts.
π Abstract
Sparsely activated Mixture-of-Experts (MoE) models scale capacity via conditional computation, but distributed inference suffers from cross-GPU expert communication and routing-induced load imbalance. Existing placement methods reduce this cost by co-locating frequently co-activated experts; however, they derive a single deployment plan from globally aggregated routing traces, thereby averaging away the heterogeneous, task-specific co-activation patterns that actually drive communication in multi-task serving. We observe that expert co-activation is strongly task-conditioned: pairs tightly coupled in one task family are often uncorrelated in another, so effective deployment should group experts by task-aware co-activation rather than by a task-agnostic average. Based on this insight, we propose \emph{Task-Aware Coactivation Grouping} (TACG), a deployment-time framework that uses family-specific dispatch and co-activation traces to derive per-expert task-family preferences, reweights the co-activation graph so that intra-family locality dominates grouping, and assigns each expert to a primary GPU under exact capacity constraints. To keep the static placement robust under online workload skew, we further introduce \emph{Generic Expert Shared Replication} (GESR), a lightweight companion that identifies generic experts with consistently central co-activation profiles, replicates them across a small set of secondary GPUs, and applies locality- and load-aware selection at serving time. Experiments on three representative open-source MoE models demonstrate that our framework reduces the average communication cost by 31.39\% over the baseline, while preserving an average Jain fairness index of 0.9975. This advantage persists even under severe distribution shifts in the inference data, consistently outperforming strong baselines.