Dirichlet-Prior Shaping: Guiding Expert Specialization in Upcycled MoEs

📅 2025-10-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Pretrained dense models upgraded to Mixture-of-Experts (MoE) architectures often suffer from insufficient expert specialization and low-confidence, poorly discriminative routing—largely due to naive weight replication. To address this, we propose the Dirichlet-Prior Shaping Loss (DPSL), which explicitly regularizes routing probabilities via a Dirichlet prior to jointly optimize expert load balancing and task- or modality-specific specialization—without altering model architecture. Evaluated on Qwen2, Phi3, and Llama3.2 backbones, DPSL achieves state-of-the-art performance on vision-language multimodal benchmarks (e.g., VQAv2, OK-VQA, NLVR2), outperforming existing upcycling and regularization methods. It improves expert differentiation by 23.6% and yields an average accuracy gain of 4.1% across benchmarks.

Technology Category

Application Category

📝 Abstract
Upcycling pre-trained dense models into sparse Mixture-of-Experts (MoEs) efficiently increases model capacity but often suffers from poor expert specialization due to naive weight replication. Our analysis reveals that upcycled MoEs, even with conventional regularization, exhibit low-confidence, weakly differentiated routing, hindering performance. We introduce Dirichlet-Prior Shaping Loss (DPSL), a novel router regularization technique that directly shapes routing probability distributions by matching expert assignments to a target Dirichlet prior. DPSL offers fine-grained control over expert balance and specialization, and enables encoding of inductive biases such as encouraging experts to focus on specific modalities or tasks, without requiring manual intervention; notably, DPSL is a general tool applicable to any module that outputs categorical probability distributions, extending its utility beyond MoE training. Experiments on upcycled MoE vision-language models (with Qwen2, Phi3, Llama3.2 LLM backbones) show DPSL consistently outperforms upcycling strategies and regularization techniques across standard vision-language benchmarks, addressing the critical issue of poor specialization and fostering more adaptive, higher-performing models.
Problem

Research questions and friction points this paper is trying to address.

Improving expert specialization in upcycled Mixture-of-Experts models
Addressing low-confidence routing in sparse model architectures
Enhancing control over expert balance without manual intervention
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dirichlet-Prior Shaping Loss regularizes routing distributions
Matches expert assignments to target Dirichlet prior
Enables fine-grained control over expert specialization
Leyla Mirvakhabova
Leyla Mirvakhabova
Qualcomm AI Research
B
B. Bejnordi
Qualcomm AI Research
G
Gaurav Kumar
Qualcomm Technologies
H
Hanxue Liang
University of Cambridge
W
Wanru Zhao
University of Cambridge
P
Paul N. Whatmough
Qualcomm AI Research