ProbMoE: Differentiable Probabilistic Routing for Mixture-of-Experts

📅 2026-05-31

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the training challenges in Mixture-of-Experts (MoE) models caused by the non-differentiability of top-k routing. To overcome this, the authors propose ProbMoE, a framework that formulates expert selection as a probability distribution over discrete subsets under a cardinality constraint, thereby enabling differentiable routing. During forward propagation, exactly k experts are activated via constrained sampling, while backward propagation employs marginal probability gradients over the subset space as unbiased surrogates for true gradients. This approach enables, for the first time, probabilistic end-to-end training with exact k-expert routing and naturally extends to dynamic-k routing, allowing per-token adaptive expert assignment. Experiments demonstrate that the Exact-k variant significantly improves expert utilization and routing diversity, whereas the Dynamic-k variant achieves comparable performance with fewer activated experts.

📝 Abstract

Mixture-of-Experts (MoE) models scale by activating only a small subset of experts per token. However, training such models remains challenging because top-$k$ routing is discrete and non-differentiable, requiring gradient estimators for expert selection whose design remains a central open problem. We introduce ProbMoE, a probabilistic routing framework that models expert selection as a distribution over cardinality-constrained expert subsets and formulates routing as probabilistic inference in this discrete subset space. We first propose ProbMoE Exact-$k$ routing, which samples $k$-expert subsets in the forward pass, and the backward pass uses gradients through each expert's exact marginal probability as a tractable surrogate for the true gradient. ProbMoE naturally generalizes to a dynamic-$k$ routing setting, where both training and inference constrain the routing cardinality to the same predefined range, allowing adaptive expert allocation per token. Across benchmarks and model backbones, ProbMoE Exact-$k$ achieves strong performance compared to competitive baselines, with improved expert utilization and routing diversity; ProbMoE Dynamic-$k$ achieves comparable performance with fewer activated experts.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

routing

non-differentiable

gradient estimation

expert selection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Probabilistic Routing

Mixture-of-Experts

Differentiable Routing