๐ค AI Summary
Existing Mixture-of-Experts (MoE) models allocate fixed computational budgets per sample, limiting adaptive control over model width and depth. This work proposes a dynamic stacking MoE architecture that constructs an incrementally refined computation graph via variable-length expert sequences, enabling computation to scale adaptively across iterative inference steps. The core contribution is a novel RNN-inspired iterative training paradigm, coupled with a ray-tracingโinspired routing heuristic, which dynamically selects and expands experts layer-by-layer without requiring predefined expert counts or explicit load-balancing constraints. Experimental results demonstrate that the method maintains or improves prediction accuracy while reducing training epochs by 10%โ40%, significantly enhancing both training efficiency and architectural flexibility.
๐ Abstract
We introduce a Mixture of Raytraced Experts, a stacked Mixture of Experts (MoE) architecture which can dynamically select sequences of experts, producing computational graphs of variable width and depth. Existing MoE architectures generally require a fixed amount of computation for a given sample. Our approach, in contrast, yields predictions with increasing accuracy as the computation cycles through the experts' sequence. We train our model by iteratively sampling from a set of candidate experts, unfolding the sequence akin to how Recurrent Neural Networks are trained. Our method does not require load-balancing mechanisms, and preliminary experiments show a reduction in training epochs of 10% to 40% with a comparable/higher accuracy. These results point to new research directions in the field of MoEs, allowing the design of potentially faster and more expressive models. The code is available at https://github.com/nutig/RayTracing