STAR: Rethinking MoE Routing as Structure-Aware Subspace Learning

📅 2026-06-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limitations of traditional Mixture-of-Experts (MoE) routing mechanisms, which ignore input structure and consequently suffer from unstable expert assignment and insufficient specialization. The authors introduce, for the first time, a structure-aware approach to MoE routing by reformulating it as a dynamic subspace learning problem. Specifically, they employ the Generalized Hebbian Algorithm (GHA) to online track the principal component subspace of incoming inputs and jointly optimize this representation with a learnable router. This enables stable, test-time-updatable expert selection. Evaluated on both synthetic data and large-scale language and vision tasks, the proposed method significantly improves routing quality and downstream performance, while demonstrating enhanced generalization under distribution shifts.

📝 Abstract

Mixture-of-Experts (MoE) scales model capacity efficiently by selectively routing inputs to a specialized subset of experts. However, input-expert specialization, the core motivation of MoE, critically depends on whether the router is actually aware of input structure. In practice, MoE routing is typically implemented as a shallow linear projection with limited awareness of input representation, which often leads to unstable routing. We propose STAR, a Structure Aware Routing that rethinks MoE routing as a subspace learning problem by augmenting standard learnable routing with an evolving principal subspace that tracks dominant input structure via Generalized Hebbian Algorithm (GHA). By aligning routing decisions directly with input structure, STAR enables stable expert specialization. We evaluate STAR on controlled synthetic setup and large-scale language and vision tasks, where it consistently improves routing quality and downstream performance over strong MoE baselines. Moreover, optional test-time subspace updates further enhance routing robustness and generalization under input distribution shifts.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

routing

input structure

expert specialization

subspace learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts

Structure-Aware Routing

Subspace Learning