๐ค AI Summary
Existing token-level Mixture-of-Experts (MoE) models suffer from semantic contamination across experts, imbalanced expert load, and capacity bottlenecks due to routing entire tokens holistically. This paper proposes SliceMoE, which partitions hidden vectors into contiguous slices and routes each slice independently to expertsโenabling finer-grained, more balanced model scaling. We introduce slice-level capacity loss and cross-slice dropout to encourage interpretable expert specialization in semantic versus syntactic capabilities. A lightweight shared router predicts top-k experts per slice, and fused batched GEMM operations optimize computation. Experiments on language modeling, machine translation, and text classification show that SliceMoE achieves 1.7ร faster inference than dense baselines and reduces perplexity by 12โ18% compared to parameter-matched token-level MoE, while significantly improving expert load balance.
๐ Abstract
Mixture-of-Experts (MoE) layers scale transformers by routing tokens to a sparse subset of feed-forward experts. Token-level routing, however, assigns an entire semantic spectrum to each expert, creating capacity bottlenecks, load-balancing pathologies, and limited specialization. We introduce SliceMoE, an architecture that routes contiguous slices of a token's hidden vector. A d-dimensional embedding is partitioned into S slices, and for each slice, a lightweight shared router predicts the top-k experts. Experts operate on their assigned slices independently, and outputs are reassembled, maintaining per-token FLOP efficiency. Because slices from different tokens interleave within an expert, utilization is naturally smoother. We propose a slice-level capacity loss, cross-slice dropout, and efficient fused batched GEMM kernels. Experiments on WikiText-103 language modeling, WMT En-De translation, and three text-classification datasets show SliceMoE attains up to 1.7x faster inference than dense baselines, 12 to 18 percent lower perplexity than parameter-matched token-MoE, and improved expert balance, with interpretable expertise over syntactic versus semantic subspaces.