Mixture of Raytraced Experts

📅 2025-07-16

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

Existing Mixture-of-Experts (MoE) models allocate fixed computational budgets per sample, limiting adaptive control over model width and depth. This work proposes a dynamic stacking MoE architecture that constructs an incrementally refined computation graph via variable-length expert sequences, enabling computation to scale adaptively across iterative inference steps. The core contribution is a novel RNN-inspired iterative training paradigm, coupled with a ray-tracing–inspired routing heuristic, which dynamically selects and expands experts layer-by-layer without requiring predefined expert counts or explicit load-balancing constraints. Experimental results demonstrate that the method maintains or improves prediction accuracy while reducing training epochs by 10%–40%, significantly enhancing both training efficiency and architectural flexibility.

Technology Category

Application Category

📝 Abstract

We introduce a Mixture of Raytraced Experts, a stacked Mixture of Experts (MoE) architecture which can dynamically select sequences of experts, producing computational graphs of variable width and depth. Existing MoE architectures generally require a fixed amount of computation for a given sample. Our approach, in contrast, yields predictions with increasing accuracy as the computation cycles through the experts' sequence. We train our model by iteratively sampling from a set of candidate experts, unfolding the sequence akin to how Recurrent Neural Networks are trained. Our method does not require load-balancing mechanisms, and preliminary experiments show a reduction in training epochs of 10% to 40% with a comparable/higher accuracy. These results point to new research directions in the field of MoEs, allowing the design of potentially faster and more expressive models. The code is available at https://github.com/nutig/RayTracing

Problem

Research questions and friction points this paper is trying to address.

Dynamic expert selection for variable computation graphs

Improved accuracy with sequential expert computation

Reduced training epochs without load-balancing mechanisms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic expert selection for variable computation graphs

Iterative expert sampling for efficient training

No load-balancing needed, reduces training epochs significantly

🔎 Similar Papers

No similar papers found.