🤖 AI Summary
Whether expert routing in large-scale Mixture-of-Experts (MoE) models—specifically DeepSeek-R1—transcends the conventional token-driven paradigm to achieve semantic-level specialization remains unclear.
Method: We conduct systematic analysis via word sense disambiguation, interactive cognitive reasoning in DiscoveryWorld, expert activation pattern visualization, and statistical attribution analysis.
Contribution/Results: (1) Polysemous words consistently activate distinct expert subsets across different semantic contexts; (2) complex reasoning tasks elicit staged, modular expert collaboration; (3) we provide the first empirical evidence in an ultra-large open-source MoE model that expert activation exhibits strong semantic specificity—revealing an emergent “scale-driven semantic specialization” phenomenon. This challenges the prevailing view that MoE routing relies solely on shallow lexical features, demonstrating instead that semantic abstraction emerges robustly with scale.
📝 Abstract
DeepSeek-R1, the largest open-source Mixture-of-Experts (MoE) model, has demonstrated reasoning capabilities comparable to proprietary frontier models. Prior research has explored expert routing in MoE models, but findings suggest that expert selection is often token-dependent rather than semantically driven. Given DeepSeek-R1's enhanced reasoning abilities, we investigate whether its routing mechanism exhibits greater semantic specialization than previous MoE models. To explore this, we conduct two key experiments: (1) a word sense disambiguation task, where we examine expert activation patterns for words with differing senses, and (2) a cognitive reasoning analysis, where we assess DeepSeek-R1's structured thought process in an interactive task setting of DiscoveryWorld. We conclude that DeepSeek-R1's routing mechanism is more semantically aware and it engages in structured cognitive processes.