The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level

📅 2026-04-02

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This study investigates whether sparsely activated Mixture-of-Experts (MoE) language models offer greater interpretability than dense feedforward networks and elucidates the true functional roles of expert modules. Leveraging k-sparse probing, automated interpretation techniques, and routing sparsity analysis, the research demonstrates that MoE experts exhibit more semantically coherent and functionally focused behavior compared to neurons in dense FFNs. For the first time, it is systematically shown that experts do not align with broad domains or token types but instead perform fine-grained linguistic tasks—such as closing LaTeX brackets—establishing the expert level as a meaningful unit for interpretability analysis. These findings substantiate that MoE architectures possess intrinsic interpretability at the expert level, offering a promising avenue toward transparent large language models.

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) architectures have become the dominant choice for scaling Large Language Models (LLMs), activating only a subset of parameters per token. While MoE architectures are primarily adopted for computational efficiency, it remains an open question whether their sparsity makes them inherently easier to interpret than dense feed-forward networks (FFNs). We compare MoE experts and dense FFNs using $k$-sparse probing and find that expert neurons are consistently less polysemantic, with the gap widening as routing becomes sparser. This suggests that sparsity pressures both individual neurons and entire experts toward monosemanticity. Leveraging this finding, we zoom out from the neuron to the expert level as a more effective unit of analysis. We validate this approach by automatically interpreting hundreds of experts. This analysis allows us to resolve the debate on specialization: experts are neither broad domain specialists (e.g., biology) nor simple token-level processors. Instead, they function as fine-grained task experts, specializing in linguistic operations or semantic tasks (e.g., closing brackets in LaTeX). Our findings suggest that MoEs are inherently interpretable at the expert level, providing a clearer path toward large-scale model interpretability. Code is available at: https://github.com/jerryy33/MoE_analysis

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

interpretability

sparsity

monosemanticity

expert specialization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts

interpretability

monosemanticity