ViBE: Co-Optimizing Workload Skew and Hardware Variability for MoE Serving

📅 2026-05-30

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

This work addresses the straggler problem in distributed Mixture-of-Experts (MoE) inference, where input-dependent token routing interacts with GPU performance heterogeneity under synchronous execution. The authors propose ViBE, a hardware-aware expert placement framework that jointly models device capabilities and expert activation loads for the first time. By assigning high-load experts to high-performance GPUs and low-load experts to lower-performance ones, ViBE balances execution latency rather than merely token counts. The framework integrates fine-grained GPU performance modeling, expert activation profiling, a hardware-aware bin-packing strategy, and a lightweight runtime recalibration mechanism. This approach substantially reduces execution time imbalance, improving SLO attainment by 14% and reducing P90 first-token latency by up to 45%, with gains amplifying as system scale increases.

📝 Abstract

In distributed Mixture-of-Experts (MoE) inference, input-dependent token routing interacts with GPU performance variability to create persistent stragglers under synchronized execution, where the slowest GPU determines layer latency. This performance variability is inherent to modern accelerators: manufacturing variation, power limits, and thermal conditions introduce measurable execution-time differences across nominally identical GPUs. The core challenge is that MoE execution-time imbalance arises from the interaction of workload skew and hardware asymmetry. Token routing produces uneven and layer-varying expert loads, while GPU throughput depends on device-specific operating characteristics and workload intensity. Prior work mitigates routing skew but assumes homogeneous hardware, optimizing token balance rather than execution latency. As a result, even balanced token assignments can leave hardware-induced stragglers unaddressed. Thus, we propose Variability-Informed Binning of Experts (ViBE), a hardware-aware expert placement framework that minimizes execution-time imbalance across GPUs. ViBE combines per-GPU performance modeling with expert activation profiling to assign high-load experts to faster devices and low-load experts to slower ones, reducing layer-level stragglers without modifying model semantics or hardware. Because both workload characteristics and effective GPU throughput can shift across serving conditions, ViBE supports lightweight recalibration under workload/performance drift to refresh its routing and performance estimates when needed. Results show that ViBE consistently reduces execution-time imbalance and improves SLO attainment by 14%, while lowering P90 TTFT by up to 45%. We further show that the impact of hardware variability increases at scale, making variability-aware placement important for efficient, high-utilization LLM serving.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

workload skew

hardware variability

straggler mitigation

distributed inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts

hardware variability

straggler mitigation