Decoding Knowledge Attribution in Mixture-of-Experts: A Framework of Basic-Refinement Collaboration and Efficiency Analysis

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing attribution methods struggle to characterize the dynamic router-expert interactions in sparse Mixture-of-Experts (MoE) models. Method: This paper introduces the first cross-layer attribution algorithm tailored for heterogeneous MoE architectures, uncovering an “early filtering, late refinement” knowledge flow pattern and a foundational-refinement collaborative framework. Contributions/Results: Through attention-head–expert correlation modeling (r = 0.68), expert ablation via masking, and multi-model comparison (Qwen-MoE, OLMoE, Mixtral vs. dense baselines), we find: (1) shared experts encode general-purpose representations, while routing experts specialize in domain-specific refinement; (2) deeper-layer redundancy substantially enhances robustness—geographic task MRR degrades only 43% under perturbation versus 76% in shallow layers; (3) single-layer inference efficiency improves by 37%. These findings establish the principle of depth-dependent robustness and yield task-sensitive architectural design guidelines.

Technology Category

Application Category

📝 Abstract
The interpretability of Mixture-of-Experts (MoE) models, especially those with heterogeneous designs, remains underexplored. Existing attribution methods for dense models fail to capture dynamic routing-expert interactions in sparse MoE architectures. To address this issue, we propose a cross-level attribution algorithm to analyze sparse MoE architectures (Qwen 1.5-MoE, OLMoE, Mixtral-8x7B) against dense models (Qwen 1.5-7B, Llama-7B, Mixtral-7B). Results show MoE models achieve 37% higher per-layer efficiency via a"mid-activation, late-amplification"pattern: early layers screen experts, while late layers refine knowledge collaboratively. Ablation studies reveal a"basic-refinement"framework--shared experts handle general tasks (entity recognition), while routed experts specialize in domain-specific processing (geographic attributes). Semantic-driven routing is evidenced by strong correlations between attention heads and experts (r=0.68), enabling task-aware coordination. Notably, architectural depth dictates robustness: deep Qwen 1.5-MoE mitigates expert failures (e.g., 43% MRR drop in geographic tasks when blocking top-10 experts) through shared expert redundancy, whereas shallow OLMoE suffers severe degradation (76% drop). Task sensitivity further guides design: core-sensitive tasks (geography) require concentrated expertise, while distributed-tolerant tasks (object attributes) leverage broader participation. These insights advance MoE interpretability, offering principles to balance efficiency, specialization, and robustness.
Problem

Research questions and friction points this paper is trying to address.

Analyzing dynamic routing-expert interactions in sparse MoE models
Proposing cross-level attribution for sparse vs dense model comparison
Investigating efficiency and robustness in MoE architectures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-level attribution algorithm for sparse MoE
Basic-refinement framework with shared experts
Semantic-driven routing via attention-expert correlation
🔎 Similar Papers
No similar papers found.
J
Junzhuo Li
The Hong Kong University of Science and Technology (Guangzhou), The Hong Kong University of Science and Technology
B
Bo Wang
The Hong Kong University of Science and Technology (Guangzhou)
Xiuze Zhou
Xiuze Zhou
The Hong Kong University of Science and Technology (Guangzhou)
Machine LearningRecommendation SystemsLarge Language Models
P
Peijie Jiang
Ant Group
J
Jia Liu
Ant Group
Xuming Hu
Xuming Hu
Assistant Professor, HKUST(GZ) / HKUST
Natural Language ProcessingLarge Language Model