Route Experts by Sequence, not by Token

📅 2025-11-09

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Existing Top-K routing in Mixture-of-Experts (MoE) models ignores token-level complexity heterogeneity, while parameter-free adaptive alternatives often require retraining or introduce auxiliary parameters. To address this, we propose SeqTopK: a parameter-free, post-hoc dynamic routing mechanism operating at the sequence level—rather than the token level—without architectural modification or fine-tuning. SeqTopK allocates expert capacity uniformly across the entire input sequence, enabling high-complexity tokens to activate more experts and low-complexity tokens to engage fewer, while strictly preserving the total computational budget. It is fully plug-and-play for any pretrained MoE model, requiring only minimal code changes. Experiments demonstrate that SeqTopK consistently outperforms standard Top-K and all existing parameter-free adaptive baselines across diverse domains—including mathematical reasoning, programming, legal analysis, and creative writing—with up to 16.9% improvement under highly sparse configurations.

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) architectures scale large language models (LLMs) by activating only a subset of experts per token, but the standard TopK routing assigns the same fixed number of experts to all tokens, ignoring their varying complexity. Prior adaptive routing methods introduce additional modules and hyperparameters, often requiring costly retraining from scratch. We propose Sequence-level TopK (SeqTopK), a minimal modification that shifts the expert budget from the token level to the sequence level. By selecting the top $T cdot K$ experts across all $T$ tokens, SeqTopK enables end-to-end learned dynamic allocation -- assigning more experts to difficult tokens and fewer to easy ones -- while preserving the same overall budget. SeqTopK requires only a few lines of code, adds less than 1% overhead, and remains fully compatible with pretrained MoE models. Experiments across math, coding, law, and writing show consistent improvements over TopK and prior parameter-free adaptive methods, with gains that become substantially larger under higher sparsity (up to 16.9%). These results highlight SeqTopK as a simple, efficient, and scalable routing strategy, particularly well-suited for the extreme sparsity regimes of next-generation LLMs. Code is available at https://github.com/Y-Research-SBU/SeqTopK.

Problem

Research questions and friction points this paper is trying to address.

Standard TopK routing assigns fixed experts to all tokens, ignoring complexity variation.

Prior adaptive methods require costly retraining with additional modules and hyperparameters.

SeqTopK enables dynamic expert allocation while preserving budget and pretrained model compatibility.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Shifts expert budget from token to sequence level

Enables end-to-end learned dynamic expert allocation

Maintains full compatibility with pretrained MoE models

🔎 Similar Papers

Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model