Redesign Mixture-of-Experts Routers with Manifold Power Iteration

📅 2026-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of effective design principles for routers in existing Mixture-of-Experts (MoE) models, which struggle to accurately capture the affinity between tokens and experts. The authors propose, for the first time, using the dominant singular directions of expert matrices as the target for router design and introduce a novel “power iteration followed by shrinkage” paradigm. During pretraining, they employ manifold optimization to dynamically align the router’s row vectors with these dominant singular directions. This approach achieves a favorable balance among alignment accuracy, computational efficiency, and training stability. Experiments on MoE models ranging from 1B to 11B parameters demonstrate substantial performance improvements, validating the effectiveness of the proposed router redesign strategy.
📝 Abstract
Router is the cornerstone component to the Mixture-of-Experts models. Serving as expert proxies, the rows of the router matrix compute their similarity to the MoE inputs to determine which subset of experts is activated. Ideally, each router row is designed to encode the expert matrix into this representative vector, such that its dot-product with token can better reflect token-expert affinity. However, there exists no design principles to enforce this condensation. In this paper, we propose to align each router row with the principal singular direction of the associated expert, as this direction provides the most expressive mathematical description of a matrix. Based on this principle, we propose a router redesign with Manifold Power Iteration (MPI). Specifically, it introduces a "Power-then-Retract" paradigm, where a power iteration step is performed on the router weights, followed by a retraction to impose a norm constraint to ensure both efficiency and stability. Theoretically, we show that MPI drives router rows to converge toward the principal singular directions of associated experts. Empirically, we pretrain MoE model across scales from 1B to 11B parameters to confirm that this alignment facilitates more effective MoE models.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
router design
expert representation
singular direction
token-expert affinity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
Router Design
Manifold Power Iteration
Singular Direction Alignment
Power-then-Retract