Decoding Knowledge Attribution in Mixture-of-Experts: A Framework of Basic-Refinement Collaboration and Efficiency Analysis

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

240K/year

🤖 AI Summary

Existing attribution methods struggle to characterize the dynamic router-expert interactions in sparse Mixture-of-Experts (MoE) models. Method: This paper introduces the first cross-layer attribution algorithm tailored for heterogeneous MoE architectures, uncovering an “early filtering, late refinement” knowledge flow pattern and a foundational-refinement collaborative framework. Contributions/Results: Through attention-head–expert correlation modeling (r = 0.68), expert ablation via masking, and multi-model comparison (Qwen-MoE, OLMoE, Mixtral vs. dense baselines), we find: (1) shared experts encode general-purpose representations, while routing experts specialize in domain-specific refinement; (2) deeper-layer redundancy substantially enhances robustness—geographic task MRR degrades only 43% under perturbation versus 76% in shallow layers; (3) single-layer inference efficiency improves by 37%. These findings establish the principle of depth-dependent robustness and yield task-sensitive architectural design guidelines.

Technology Category

Application Category

📝 Abstract

The interpretability of Mixture-of-Experts (MoE) models, especially those with heterogeneous designs, remains underexplored. Existing attribution methods for dense models fail to capture dynamic routing-expert interactions in sparse MoE architectures. To address this issue, we propose a cross-level attribution algorithm to analyze sparse MoE architectures (Qwen 1.5-MoE, OLMoE, Mixtral-8x7B) against dense models (Qwen 1.5-7B, Llama-7B, Mixtral-7B). Results show MoE models achieve 37% higher per-layer efficiency via a"mid-activation, late-amplification"pattern: early layers screen experts, while late layers refine knowledge collaboratively. Ablation studies reveal a"basic-refinement"framework--shared experts handle general tasks (entity recognition), while routed experts specialize in domain-specific processing (geographic attributes). Semantic-driven routing is evidenced by strong correlations between attention heads and experts (r=0.68), enabling task-aware coordination. Notably, architectural depth dictates robustness: deep Qwen 1.5-MoE mitigates expert failures (e.g., 43% MRR drop in geographic tasks when blocking top-10 experts) through shared expert redundancy, whereas shallow OLMoE suffers severe degradation (76% drop). Task sensitivity further guides design: core-sensitive tasks (geography) require concentrated expertise, while distributed-tolerant tasks (object attributes) leverage broader participation. These insights advance MoE interpretability, offering principles to balance efficiency, specialization, and robustness.

Problem

Research questions and friction points this paper is trying to address.

Analyzing dynamic routing-expert interactions in sparse MoE models

Proposing cross-level attribution for sparse vs dense model comparison

Investigating efficiency and robustness in MoE architectures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-level attribution algorithm for sparse MoE

Basic-refinement framework with shared experts

Semantic-driven routing via attention-expert correlation

🔎 Similar Papers

On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions