🤖 AI Summary
Existing MoE-based large language models suffer from inefficient expert prediction—particularly the inability to prefetch experts at the first layer, low prediction accuracy, and high computational overhead. This paper proposes a pre-attention expert prediction mechanism: leveraging lightweight activation features computed prior to the attention layer within the same transformer block, it constructs an efficient routing module via bilinear transformation and a ranking-aware loss function. Crucially, we identify for the first time a strong rank-preservation property in expert selection across LLMs, enabling highly accurate modeling of expert rankings using simple linear models—and thereby enabling prefetching even at the initial layer. Evaluated on DeepSeek-V2-Lite, Qwen3-30B, and Phi-mini-MoE, our method achieves expert prediction accuracies of 93.03%, 94.69%, and 97.62%, respectively—improving over state-of-the-art by approximately 15 percentage points—and significantly reduces inference latency and computational cost.
📝 Abstract
Mixture-of-Experts (MoE) Large Language Models (LLMs) efficiently scale-up the model while keeping relatively low inference cost. As MoE models only activate part of the experts, related work has proposed expert prediction and caching methods to prefetch the experts for faster inference. However, existing approaches utilize the activations from the previous layer for prediction, incurring low accuracy and leave the first layer unoptimized. Applying complex layers or even training standalone networks for better prediction introduces high computation overhead. In this paper, we propose pre-attention expert prediction to achieve accurate and lightweight expert prefetching. The key insight is that some functions in LLMs are ranking-preserving, indicating that matching the ranking of selected experts using simple linear functions is possible. Therefore, we utilize the activations before the attention block in the same layer with 2 linear functions and ranking-aware loss to achieve accurate prediction, which also supports prefetching in the first layer. Our lightweight, pre-attention expert routers achieve 93.03% accuracy on DeepSeek V2 Lite, 94.69% on Qwen3-30B, and 97.62% on Phi-mini-MoE, showing about 15% improvement on absolute accuracy over the state-of-the-art methods.