Opportunistic Expert Activation: Batch-Aware Expert Routing for Faster Decode Without Retraining

📅 2025-11-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address memory constraints and high latency in autoregressive decoding of Mixture-of-Experts (MoE) large language models—caused by excessive expert activations per token—this paper proposes a training-free, dynamic token-to-expert rerouting mechanism. The core innovation is a batch-aware routing strategy: leveraging real-time per-expert load information within the current batch, it opportunistically reroutes incoming tokens to already-activated but underutilized experts, thereby improving expert reuse and reducing memory access overhead. This method significantly decreases the number of activated experts per decoding step without requiring model retraining. Evaluated on Qwen3-30B and Qwen3-235B, it achieves 39% and 15% latency reduction in MoE-layer decoding, respectively, while preserving generation quality with no statistically significant degradation.

Technology Category

Application Category

📝 Abstract
An increasing number of LLMs employ Mixture-of-Experts (MoE) architectures where the feed-forward layer is replaced by a pool of experts and each token only activates a small subset of them. During autoregressive generation, these models often enter a memory-bound regime even for moderate batch sizes because the average expert load grows more slowly than in an equivalent dense feedforward layer. Consequently, MoE latency is governed by the number of activated experts. We introduce a framework for dynamically re-routing token-to-expert mapping to lower this number (and thus, the decode latency) while preserving a comparable quality. Our best results use a batch-aware routing that works by having tokens piggyback experts that have already been loaded into memory due to being crucial to other tokens within the same batch. Empirically, we evaluate our method on the Qwen3-30B and Qwen3-235B models with a batch size of $16$. Without any statistically significant loss in accuracy, our approach achieves latency reductions of $39%$ and $15%$ in the MoE layer decode latency, respectively.
Problem

Research questions and friction points this paper is trying to address.

Reducing MoE model latency by optimizing expert activation during decoding
Dynamically rerouting tokens to minimize activated experts per batch
Maintaining model quality while significantly improving inference speed
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic token-to-expert routing reduces activated experts
Batch-aware routing piggybacks pre-loaded experts in memory
Method preserves model quality while cutting decode latency
🔎 Similar Papers
No similar papers found.
C
Costin-Andrei Oncescu
Harvard University. Part of the work was done when Costin was interning at Together AI
Qingyang Wu
Qingyang Wu
Together AI
Text GenerationDialog SystemsMultimodal
W
Wai Tong Chung
Together AI
R
Robert Wu
Together AI
B
Bryan Gopal
Together AI
Junxiong Wang
Junxiong Wang
Cornell Unversity
Tri Dao
Tri Dao
Princeton University, Together AI
Machine learningSystems
Ben Athiwaratkun
Ben Athiwaratkun
Together AI
Artificial Intelligence