π€ AI Summary
This study addresses a critical flaw in existing MoE pruning methods, which erroneously treat observational statistics as causal evidence when inferring expert importance. For the first time, the authors conduct a causal audit of expert importance in Mixture-of-Experts (MoE) models through token-level intervention experiments across three highly redundant MoE architectures. Leveraging a rigorous causal inference framework, multiple comparison corrections, and precise control over routing weights, they systematically evaluate the predictive validity of prevailing observational metrics. The findings reveal that, under stringent testing, none of these metrics reliably predict the causal effect of experts (effect size d < 0.17), with only one marginally significant signal detected (d = 0.231, p = 0.0013). These results fundamentally challenge the inferential basis of current interpretability approaches, demonstrating a lack of substantive alignment between observational and causal measures of expert importance.
π Abstract
Interpretability methods routinely use population-level summary statistics over observed model behaviour to license claims about the effects of targeted interventions on specific computations; in Pearl's terms, they treat rung-1 associational evidence as if it supported rung-2 interventional conclusions, a move whose validity is rarely tested. We examine one concrete instance: the use of routing statistics in Mixture-of-Experts (MoE) pruning, where utilization rates, activation norms, and routing weight distributions are treated as predictors of which experts can be removed without functional cost. A token-level interventional audit across three high-redundancy MoE architectures (OLMoE-1B-7B-0924, Qwen1.5-MoE-A2.7B, DeepSeek-V2-Lite) finds no observational metric predicts causal expert importance after multiple-comparison correction in any model, with effect sizes below Cohen's $d = 0.17$ across all 60 metric-layer combinations. A per-token routing weight control rules out insufficient power, recovering a single Bonferroni-significant signal at OLMoE's final MoE layer ($d = +0.231$, $p = 0.0013$). Existing pruning methods succeed in this regime not by identifying dispensable experts but because early-layer redundancy renders most selection criteria interchangeable. Our results provide an explicit counterexample to the common inferential step from population-level observational summaries to token-level interventional claims about expert importance, and illustrate how interventional audits can calibrate the evidential standards for interpretability claims.