Multilingual Routing in Mixture-of-Experts

📅 2025-10-06

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work investigates sparse routing dynamics in Mixture-of-Experts (MoE) models under multilingual settings and identifies cross-lingual routing alignment—particularly in intermediate layers—whose degree strongly correlates with multilingual performance. Building on this observation, we propose a lightweight inference-time routing guidance strategy: explicitly steering non-English tokens to English-activated generalist experts at critical intermediate layers to enhance cross-lingual generalization. The method requires no fine-tuning or additional parameters. It delivers consistent 1–2% absolute improvements across two canonical NLP tasks, three distinct MoE architectures, and over 15 languages, demonstrating its effectiveness and architecture- and task-agnostic applicability. Our core contribution is the first identification and exploitation of cross-lingual routing alignment as an interpretable, low-overhead intervention mechanism for multilingual MoE models.

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) architectures have become the key to scaling modern LLMs, yet little is understood about how their sparse routing dynamics respond to multilingual data. In this work, we analyze expert routing patterns using parallel multilingual datasets and present highly interpretable layer-wise phenomena. We find that MoE models route tokens in language-specific ways in the early and late decoder layers but exhibit significant cross-lingual routing alignment in middle layers, mirroring parameter-sharing trends observed in dense LLMs. In particular, we reveal a clear, strong correlation between a model's performance in a given language and how similarly its tokens are routed to English in these layers. Extending beyond correlation, we explore inference-time interventions that induce higher cross-lingual routing alignment. We introduce a method that steers the router by promoting middle-layer task experts frequently activated in English, and it successfully increases multilingual performance. These 1-2% gains are remarkably consistent across two evaluation tasks, three models, and 15+ languages, especially given that these simple interventions override routers of extensively trained, state-of-the-art LLMs. In comparison, interventions outside of the middle layers or targeting multilingual-specialized experts only yield performance degradation. Altogether, we present numerous findings that explain how MoEs process non-English text and demonstrate that generalization is limited by the model's ability to leverage language-universal experts in all languages.

Problem

Research questions and friction points this paper is trying to address.

Analyzing sparse routing dynamics in multilingual MoE architectures

Investigating correlation between routing alignment and model performance

Developing inference-time interventions to improve multilingual routing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzing expert routing patterns with parallel multilingual datasets

Steering router by promoting middle-layer task experts

Increasing multilingual performance through cross-lingual routing alignment

🔎 Similar Papers

Routing Experts: Learning to Route Dynamic Experts in Multi-modal Large Language Models