DAG-MoE: From Simple Mixture to Structural Aggregation in Mixture-of-Experts

📅 2026-05-31

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the limitations of existing Mixture-of-Experts (MoE) models, which suffer from high routing overhead and a constrained expert composition space, hindering efficient scaling. The authors propose DAG-MoE, a novel framework that introduces a learnable directed acyclic graph (DAG) structure to replace the conventional weighted summation for aggregating expert outputs, without modifying the experts or router themselves. This structured aggregation enables multi-step reasoning within a single layer, substantially expanding the space of possible expert combinations. Experimental results demonstrate that DAG-MoE consistently outperforms standard MoE baselines in both pretraining and fine-tuning phases on language modeling tasks, achieving a favorable balance between model flexibility and computational efficiency.

📝 Abstract

Mixture-of-Experts (MoE) models have become a leading approach for decoupling parameter count from computational cost in large language models, yet effectively scaling MoE performance remains a challenge. Prior work shows that fine-grained experts enlarge the space of expert combinations and improve flexibility, but they also impose substantial routing overhead, creating a new scalability bottleneck. In this paper, we explore a complementary axis for scaling -- how expert outputs are aggregated. We theoretically show that replacing the standard weighted-summation aggregation with structural aggregation expands the expert-combination space without altering the experts or router, and enables possible multi-step reasoning within a single MoE layer. To this end, we propose DAG-MoE, a sparse MoE framework that employs a lightweight module to automatically learn the optimal aggregation structure among the selected experts. Extensive experiments under standard language modeling settings show that DAG-MoE consistently improves performance in both pretraining and fine-tuning, surpassing traditional MoE baselines.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

expert aggregation

scalability

routing overhead

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Structural Aggregation

Mixture-of-Experts

DAG-MoE