Expert Routing for Communication-Efficient MoE via Finite Expert Banks

📅 2026-05-06

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

This work investigates the information-theoretic efficiency of routing mechanisms in sparse Mixture-of-Experts (MoE) architectures, aiming to balance model accuracy with communication and computational resource utilization. The gating router is modeled as a stochastic channel, and a discrete mutual information estimator is proposed under a finite expert pool. Empirical posterior distributions $ q(W|S) $ are leveraged to compute $ I(X;T) $ and $ I(S;W) $, with the latter shown to exhibit a monotonic relationship with the generalization gap. The Blahut–Arimoto algorithm is employed to trace the accuracy–rate trade-off curve. Experiments demonstrate that the proposed mutual information estimator effectively tracks the generalization gap and significantly outperforms both the Xu–Raginsky bound and the uniform joint bound, offering a practical analytical tool for resource-aware MoE systems.

📝 Abstract

Resource-efficient machine learning increasingly uses sparse Mixture-of-Experts (MoE) architectures, where the gate acts as both a learning component and a routing interface controlling computation, communication, and accuracy. Motivated by finite-rate interpretations of MoE gating, we treat the gate as a stochastic channel and use $I(X;T)$ to quantify the routing information available to the selected expert. To make the associated information quantities tractable beyond synthetic examples, we develop a finite-bank MNIST construction using pretrained CNN experts and a discrete, data-dependent selection rule. Since the selected model belongs to a finite candidate set, the algorithmic mutual information $I(S;W)$ admits a closed-form discrete-entropy estimator from the empirical posterior $q(W|S)$. Sweeping a data-dependence parameter $α$, we observe that $\widehat I(S;W)$ monotonically tracks the generalization gap, while the Xu-Raginsky bound exhibits the expected looseness. We also compare with a uniform union-bound baseline and introduce an empirical estimator of $I(X;T)$ together with a Blahut-Arimoto procedure for tracing an accuracy-rate curve over the expert bank. The proposed framework provides a practical tool for analyzing resource-aware MoE inference systems and for interpreting $I(X;T)$ and $D(R_g)$ as design proxies for efficient expert routing.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

expert routing

communication efficiency

information bottleneck

finite expert bank

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts

expert routing

mutual information