🤖 AI Summary
This work investigates the capability trade-offs of Mixture-of-Experts (MoE) architectures versus dense models on memory-intensive and reasoning-intensive tasks.
Method: We conduct theoretical analysis, model synthetic graph reasoning tasks, evaluate closed-book retrieval, and perform large-scale pretraining experiments—systematically varying the number of experts while fixing activated parameter count.
Contribution/Results: We find that MoE memory performance improves monotonically with expert count, whereas performance on mathematical and linguistic reasoning benchmarks saturates. Crucially, we provide the first theoretical proof that MoE architectures exhibit fundamental limitations on structured reasoning tasks (e.g., graph reasoning), where comparably sized dense models solve them efficiently. Our results demonstrate that MoE scaling gains are highly task-dependent: MoE significantly outperforms dense models on knowledge- or memory-intensive tasks but yields diminishing returns on reasoning-intensive ones. This work rigorously characterizes the capability boundaries of MoE architectures, offering both theoretical grounding and empirical evidence to guide model selection and architectural design.
📝 Abstract
The Mixture-of-Experts (MoE) architecture enables a significant increase in the total number of model parameters with minimal computational overhead. However, it is not clear what performance tradeoffs, if any, exist between MoEs and standard dense transformers. In this paper, we show that as we increase the number of experts (while fixing the number of active parameters), the memorization performance consistently increases while the reasoning capabilities saturate. We begin by analyzing the theoretical limitations of MoEs at reasoning. We prove that there exist graph problems that cannot be solved by any number of experts of a certain width; however, the same task can be easily solved by a dense model with a slightly larger width. On the other hand, we find that on memory-intensive tasks, MoEs can effectively leverage a small number of active parameters with a large number of experts to memorize the data. We empirically validate these findings on synthetic graph problems and memory-intensive closed book retrieval tasks. Lastly, we pre-train a series of MoEs and dense transformers and evaluate them on commonly used benchmarks in math and natural language. We find that increasing the number of experts helps solve knowledge-intensive tasks, but fails to yield the same benefits for reasoning tasks.