Autonomy-of-Experts Models

📅 2025-01-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In conventional Mixture-of-Experts (MoE) models, the decoupling of routers and experts leads to suboptimal token-expert assignment. This work proposes an expert autonomous selection mechanism: it eliminates the centralized router and instead enables each expert to self-assess its suitability for a given token based on activation norm, followed by dynamic ranking to determine top-k experts for forward propagation. Coupled with low-rank weight decomposition, the approach ensures sparse and computationally efficient inference. This introduces the novel “expert autonomy” paradigm, supporting end-to-end co-optimization of experts and selection logic. Pretraining evaluations on language models ranging from 700M to 4B parameters demonstrate that our method significantly outperforms standard MoE baselines under equivalent computational budgets—yielding consistent gains in both accuracy and generalization.

Technology Category

Application Category

📝 Abstract
Mixture-of-Experts (MoE) models mostly use a router to assign tokens to specific expert modules, activating only partial parameters and often outperforming dense models. We argue that the separation between the router's decision-making and the experts' execution is a critical yet overlooked issue, leading to suboptimal expert selection and ineffective learning. To address this, we propose Autonomy-of-Experts (AoE), a novel MoE paradigm in which experts autonomously select themselves to process inputs. AoE is based on the insight that an expert is aware of its own capacity to effectively process a token, an awareness reflected in the scale of its internal activations. In AoE, routers are removed; instead, experts pre-compute internal activations for inputs and are ranked based on their activation norms. Only the top-ranking experts proceed with the forward pass, while the others abort. The overhead of pre-computing activations is reduced through a low-rank weight factorization. This self-evaluating-then-partner-comparing approach ensures improved expert selection and effective learning. We pre-train language models having 700M up to 4B parameters, demonstrating that AoE outperforms traditional MoE models with comparable efficiency.
Problem

Research questions and friction points this paper is trying to address.

Expert Selection
Mixture of Experts (MoE) Models
Learning Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Expert Autonomy
MoE Model Improvement
Self-Selection Mechanism
🔎 Similar Papers
2023-08-22Frontiers Comput. Sci.Citations: 866
Ang Lv
Ang Lv
Renmin University of China
Language Model
Ruobing Xie
Ruobing Xie
Tencent
Large Language ModelRecommender SystemNatural Language Processing
Y
Yining Qian
Southeast University, China
S
Songhao Wu
Renmin University of China
Xingwu Sun
Xingwu Sun
Tencent
Natural Language ProcessingQuestion AnsweringQuestion Generation
Z
Zhanhui Kang
Machine Learning Platform Department, Tencent
D
Di Wang
Machine Learning Platform Department, Tencent
R
Rui Yan
Renmin University of China