Mixture of Raytraced Experts

๐Ÿ“… 2025-07-16
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing Mixture-of-Experts (MoE) models allocate fixed computational budgets per sample, limiting adaptive control over model width and depth. This work proposes a dynamic stacking MoE architecture that constructs an incrementally refined computation graph via variable-length expert sequences, enabling computation to scale adaptively across iterative inference steps. The core contribution is a novel RNN-inspired iterative training paradigm, coupled with a ray-tracingโ€“inspired routing heuristic, which dynamically selects and expands experts layer-by-layer without requiring predefined expert counts or explicit load-balancing constraints. Experimental results demonstrate that the method maintains or improves prediction accuracy while reducing training epochs by 10%โ€“40%, significantly enhancing both training efficiency and architectural flexibility.

Technology Category

Application Category

๐Ÿ“ Abstract
We introduce a Mixture of Raytraced Experts, a stacked Mixture of Experts (MoE) architecture which can dynamically select sequences of experts, producing computational graphs of variable width and depth. Existing MoE architectures generally require a fixed amount of computation for a given sample. Our approach, in contrast, yields predictions with increasing accuracy as the computation cycles through the experts' sequence. We train our model by iteratively sampling from a set of candidate experts, unfolding the sequence akin to how Recurrent Neural Networks are trained. Our method does not require load-balancing mechanisms, and preliminary experiments show a reduction in training epochs of 10% to 40% with a comparable/higher accuracy. These results point to new research directions in the field of MoEs, allowing the design of potentially faster and more expressive models. The code is available at https://github.com/nutig/RayTracing
Problem

Research questions and friction points this paper is trying to address.

Dynamic expert selection for variable computation graphs
Improved accuracy with sequential expert computation
Reduced training epochs without load-balancing mechanisms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic expert selection for variable computation graphs
Iterative expert sampling for efficient training
No load-balancing needed, reduces training epochs significantly
๐Ÿ”Ž Similar Papers
No similar papers found.