PR2: Predictive Routing Replay for MoE-Based LLM Reinforcement Learning

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work addresses the training instability of Mixture-of-Experts (MoE) large language models in reinforcement learning, which arises from routing drift—manifested as a mismatch between expert activation patterns during training and inference, leading to severe fluctuations in importance sampling weights. To mitigate this issue, the authors propose a predictive routing replay mechanism that employs a lightweight routing evolution predictor to forecast short-term routing dynamics at inference time and perform top-k expert selection accordingly. During training, the model replays this predicted routing path to align the expert activation distributions between training and inference. This approach effectively reduces distributional shift while preserving gradient propagation through the experts. Experimental results demonstrate that the proposed method significantly enhances the training stability of PPO-based algorithms and achieves superior performance across multiple reasoning benchmarks.

📝 Abstract

Mixture of Experts (MoE) Large Language Models (LLMs) achieve strong performance at scale. However, reinforcement learning (RL) on MoE-based LLMs often suffers from training instability. A root cause is router drift, i.e., expert activations can change drastically across model updates and differ between disaggregated rollout and training phases, causing large rollout--training mismatch and unstable importance sampling weights in PPO-style RL algorithms. Routing replay mitigates this issue by freezing the replay route within each reasoning trajectory, but it ignores how the router evolves under off-policy updates and thus causes router staleness. To address this limitation, we propose Predictive Routing Replay (PR2), which augments each router with a lightweight evolution predictor that learns to anticipate short-horizon router evolution. During the rollout phase, we use the predictive routing distribution to apply top-$k$ routing, enabling gradients to reach experts that are likely to become active after updates. During the training phase, we replay the resulting predicted route to retain consistency for stable importance estimation. Theoretical analysis and experiments support that PR2 reduces routing-induced mismatch, improves RL stability, and yields stronger performance across various reasoning benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Mixture of Experts

Reinforcement Learning

Router Drift

Rollout-Training Mismatch

Importance Sampling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Predictive Routing Replay

Mixture of Experts

Router Drift