MESA: Improving MoE Safety Alignment via Decentralized Expertise

📅 2026-05-30

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work addresses the vulnerability of Mixture-of-Experts (MoE) architectures, where safety capabilities are overly concentrated in a few experts, rendering them susceptible to adversarial attacks, and where conventional alignment methods neglect functional disparities among parameters, thereby compromising model utility. To mitigate these issues, the paper introduces MESA, a novel framework that formulates safety alignment as an optimal transport problem for the first time. By reallocating expert capacity and imposing dynamic routing constraints, MESA decentralizes safety responsibilities to the most cost-effective experts. Empirical results demonstrate that this approach significantly enhances robustness against harmful inputs across multiple benchmarks while effectively preserving the model’s overall usefulness.

📝 Abstract

Mixture-of-Experts (MoE) architectures scale Large Language Models (LLMs) efficiently, enabling greater capacity with reduced computational cost by dynamically routing inputs to relevant experts, yet introduce a critical vulnerability: Safety Sparsity, where safety capabilities concentrate in few experts, making them susceptible to adversarial bypassing. Meanwhile, conventional alignment methods uniformly adapt all parameters, ignoring their functional differences and inadvertently degrading performances. To address these challenges, we propose MESA (MoE Safety Alignment), a targeted alignment framework for MoE-based LLMs that strategically decentralizes safety responsibility to maximize coverage while minimizing interference with utility. Based on Optimal Transport (OT) theory, MESA operates through two mechanisms: (1) Expert Capacity Reallocation uses a transport cost matrix to distribute safety duties to the most cost-effective experts, and (2) Dynamic Routing Refinement constrains the router to precisely activate these decentralized modules. Experiments show that MESA achieves robust defensive performance against varied harmful benchmarks while preserving helpfulness. Code is available at https://github.com/lorraine021/MESA.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

Safety Sparsity

Alignment

Large Language Models

Adversarial Bypassing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts

Safety Alignment

Optimal Transport