DOT-MoE: Differentiable Optimal Transport for MoEfication

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the challenge of efficiently and stably converting pretrained dense large language models into sparse Mixture-of-Experts (MoE) architectures, circumventing the high cost and instability of training MoE models from scratch. It introduces the first MoEfication method based on Differentiable Optimal Transport (DOT), formulating expert decomposition in feedforward networks as a neuron assignment problem with strict capacity constraints. Balanced assignment is achieved via Sinkhorn-Knopp iterations, and discrete neuron allocation is jointly optimized end-to-end with token routing using a Straight-Through Estimator (STE). Experiments demonstrate that the proposed approach substantially outperforms baselines such as structured pruning and heuristic clustering across multiple architectures and benchmarks, retaining 90% of the original model’s performance while reducing activated parameters by 50%.

📝 Abstract

The scaling of Large Language Models (LLMs) has driven significant performance gains but created substantial challenges in inference efficiency. While Mixture of Experts (MoEs) architectures address this by decoupling model size from inference cost, training MoEs from scratch is often unstable and compute intensive. Conversion of pre-trained dense models into sparse MoEs has emerged as an alternative solution; however, existing methods typically rely on heuristic neuron clustering or random splitting to partition the Feed-Forward Network (FFN) into experts. In this work, we propose DOT-MoE, a novel framework that formulates the decomposition of dense layers as a Differentiable Optimal Transport (DOT) problem. Instead of static heuristics, we model neuron assignment as a balanced transport problem, utilizing differentiable Sinkhorn-Knopp iterations to enforce strict expert capacity constraints. Furthermore, we utilize Straight-Through Estimators (STE) to jointly learn the discrete neuron-to-expert assignment and the token-to-expert routing policy end-to-end. Extensive experiments across multiple architectures and benchmarks demonstrate that DOT-MoE significantly outperforms structured pruning, heuristic clustering, and random-split baselines, retaining 90% of the original dense model's performance while reducing active parameters by 50%.

Problem

Research questions and friction points this paper is trying to address.

Mixture of Experts

Model Conversion

Inference Efficiency

Large Language Models

Sparse Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Differentiable Optimal Transport

Mixture of Experts

Neuron Assignment