🤖 AI Summary
Existing attribution methods struggle to characterize the dynamic router-expert interactions in sparse Mixture-of-Experts (MoE) models. Method: This paper introduces the first cross-layer attribution algorithm tailored for heterogeneous MoE architectures, uncovering an “early filtering, late refinement” knowledge flow pattern and a foundational-refinement collaborative framework. Contributions/Results: Through attention-head–expert correlation modeling (r = 0.68), expert ablation via masking, and multi-model comparison (Qwen-MoE, OLMoE, Mixtral vs. dense baselines), we find: (1) shared experts encode general-purpose representations, while routing experts specialize in domain-specific refinement; (2) deeper-layer redundancy substantially enhances robustness—geographic task MRR degrades only 43% under perturbation versus 76% in shallow layers; (3) single-layer inference efficiency improves by 37%. These findings establish the principle of depth-dependent robustness and yield task-sensitive architectural design guidelines.
📝 Abstract
The interpretability of Mixture-of-Experts (MoE) models, especially those with heterogeneous designs, remains underexplored. Existing attribution methods for dense models fail to capture dynamic routing-expert interactions in sparse MoE architectures. To address this issue, we propose a cross-level attribution algorithm to analyze sparse MoE architectures (Qwen 1.5-MoE, OLMoE, Mixtral-8x7B) against dense models (Qwen 1.5-7B, Llama-7B, Mixtral-7B). Results show MoE models achieve 37% higher per-layer efficiency via a"mid-activation, late-amplification"pattern: early layers screen experts, while late layers refine knowledge collaboratively. Ablation studies reveal a"basic-refinement"framework--shared experts handle general tasks (entity recognition), while routed experts specialize in domain-specific processing (geographic attributes). Semantic-driven routing is evidenced by strong correlations between attention heads and experts (r=0.68), enabling task-aware coordination. Notably, architectural depth dictates robustness: deep Qwen 1.5-MoE mitigates expert failures (e.g., 43% MRR drop in geographic tasks when blocking top-10 experts) through shared expert redundancy, whereas shallow OLMoE suffers severe degradation (76% drop). Task sensitivity further guides design: core-sensitive tasks (geography) require concentrated expertise, while distributed-tolerant tasks (object attributes) leverage broader participation. These insights advance MoE interpretability, offering principles to balance efficiency, specialization, and robustness.