Beyond Task-Agnostic: Task-Aware Grouping for Communication-Efficient Multi-Task MoE Inference

πŸ“… 2026-05-31
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

250K/year
πŸ€– AI Summary
This work addresses the high communication overhead and routing load imbalance in distributed inference of multitask sparse-activated Mixture-of-Experts (MoE) models, which stem from existing task-agnostic global aggregation strategies that overlook task-specific expert co-activation patterns. The authors propose a Task-Aware Expert Clustering framework (TACG) that constructs an expert co-activation graph by modeling collaborative activation trajectories across task families and allocates experts to GPUs under capacity constraints. Additionally, a Generalist Expert Shared Replication (GESR) mechanism is introduced to handle online load shifts. By incorporating task-aware co-activation into MoE deployment optimization for the first time, the approach reduces communication overhead by 31.39% on average across three open-source models, achieves a Jain’s fairness index of 0.9975, and significantly outperforms strong baselines under severe inference distribution shifts.
πŸ“ Abstract
Sparsely activated Mixture-of-Experts (MoE) models scale capacity via conditional computation, but distributed inference suffers from cross-GPU expert communication and routing-induced load imbalance. Existing placement methods reduce this cost by co-locating frequently co-activated experts; however, they derive a single deployment plan from globally aggregated routing traces, thereby averaging away the heterogeneous, task-specific co-activation patterns that actually drive communication in multi-task serving. We observe that expert co-activation is strongly task-conditioned: pairs tightly coupled in one task family are often uncorrelated in another, so effective deployment should group experts by task-aware co-activation rather than by a task-agnostic average. Based on this insight, we propose \emph{Task-Aware Coactivation Grouping} (TACG), a deployment-time framework that uses family-specific dispatch and co-activation traces to derive per-expert task-family preferences, reweights the co-activation graph so that intra-family locality dominates grouping, and assigns each expert to a primary GPU under exact capacity constraints. To keep the static placement robust under online workload skew, we further introduce \emph{Generic Expert Shared Replication} (GESR), a lightweight companion that identifies generic experts with consistently central co-activation profiles, replicates them across a small set of secondary GPUs, and applies locality- and load-aware selection at serving time. Experiments on three representative open-source MoE models demonstrate that our framework reduces the average communication cost by 31.39\% over the baseline, while preserving an average Jain fairness index of 0.9975. This advantage persists even under severe distribution shifts in the inference data, consistently outperforming strong baselines.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
multi-task inference
communication efficiency
expert placement
task-aware co-activation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Task-Aware Grouping
Mixture-of-Experts
Communication-Efficient Inference
Expert Placement
Load Balancing
πŸ”Ž Similar Papers