ReXMoE: Reusing Experts with Minimal Overhead in Mixture-of-Experts

📅 2025-10-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional Mixture-of-Experts (MoE) models are constrained by layer-local routing, forcing a trade-off between the number of experts and routing diversity. To address this, we propose ReXMoE, which breaks inter-layer isolation by introducing cross-layer expert reuse—enabling global parameter sharing across experts—and progressive scaling routing (PSR), which decouples expert dimensionality from per-layer parameter budgets. This design enhances model expressivity and routing flexibility without increasing total parameter count. Experiments across MoE architectures ranging from 0.5B to 7B parameters demonstrate consistent improvements in both language modeling and downstream task performance. ReXMoE thus establishes a new, efficient, and scalable MoE paradigm that overcomes fundamental limitations of conventional layer-wise MoE designs.

Technology Category

Application Category

📝 Abstract
Mixture-of-Experts (MoE) architectures have emerged as a promising approach to scale Large Language Models (LLMs). MoE boosts the efficiency by activating a subset of experts per token. Recent works show that fine-grained experts substantially enriches the combinatorial flexibility of active experts and enhances model expressiveness. However, such a design is fundamentally limited by the layer-local routing mechanism: each layer is restricted to its own expert pool. This requires a careful trade-off between expert dimensionality and routing diversity given fixed parameter budgets. We describe ReXMoE, a novel MoE architecture that improves routing beyond the existing layer-local approaches by allowing routers to reuse experts across adjacent layers. ReXMoE decouples expert dimensionality from per-layer budgets, enabling richer expert combinations without sacrificing individual expert capacity or inflating overall parameters. To this end, we propose a new progressive scaling routing (PSR) strategy to gradually increase the candidate expert pool during training. As a result, ReXMoE improves both language modeling and downstream task performance. Extensive experiments on models ranging from 0.5B to 7B parameters across different architectures demonstrate that ReXMoE consistently improves performance under fixed architectural dimensions, confirming ReXMoE as new design paradigm for parameter-efficient and scalable MoE-based LLMs.
Problem

Research questions and friction points this paper is trying to address.

Reusing experts across layers to overcome layer-local routing limitations
Decoupling expert dimensionality from per-layer parameter budgets
Enabling richer expert combinations without increasing total parameters
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reuses experts across adjacent layers
Decouples expert dimensionality from layer budgets
Uses progressive scaling routing strategy
🔎 Similar Papers
No similar papers found.
Zheyue Tan
Zheyue Tan
Aalto University
Z
Zhiyuan Li
Infinigence-AI
Tao Yuan
Tao Yuan
University of California, Los Angeles
Computer VisionArtificial Intelligence
D
Dong Zhou
Infinigence-AI
Weilin Liu
Weilin Liu
University of Ottawa
Microwave PhotonicsPhotonic Integrated CircuitsSilicon PhotonicsAll-optical signal processing
Y
Yueqing Zhuang
Infinigence-AI
Y
Yadong Li
Infinigence-AI
G
Guowei Niu
Infinigence-AI
C
Cheng Qin
Yale University
Z
Zhuyu Yao
Infinigence-AI
C
Congyi Liu
Infinigence-AI
H
Haiyang Xu
Infinigence-AI
B
Boxun Li
Infinigence-AI
Guohao Dai
Guohao Dai
Associate Professor of Shanghai Jiao Tong University
Sparse ComputationLarge-scale Graph ProcessingFPGACircuits and Systems
B
Bo Zhao
Aalto University
Y
Yu Wang
Tsinghua University