MoE-Inference-Bench: Performance Evaluation of Mixture of Expert Large Language and Vision Models

📅 2025-08-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Mixed Expert (MoE) models face system-level challenges in inference, including load imbalance and high routing overhead, hindering efficient deployment on modern hardware. Method: This paper introduces the first comprehensive benchmarking framework tailored for MoE models, systematically evaluating inference performance of large language and vision models on H100 GPUs. It quantifies the impact of batch size, sequence length, and expert architecture on throughput, integrating pruning, MoE operator fusion, speculative decoding, quantization, and parallelization strategies. Empirical analysis is conducted across prominent MoE models—including Mixtral, DeepSeek, OLMoE, and Qwen. Contribution/Results: The study identifies critical hyperparameter dependencies, determines optimal configuration combinations, and achieves substantial improvements in inference throughput and GPU utilization. It provides reproducible empirical evidence and actionable optimization guidelines for efficient MoE model deployment.

Technology Category

Application Category

📝 Abstract
Mixture of Experts (MoE) models have enabled the scaling of Large Language Models (LLMs) and Vision Language Models (VLMs) by achieving massive parameter counts while maintaining computational efficiency. However, MoEs introduce several inference-time challenges, including load imbalance across experts and the additional routing computational overhead. To address these challenges and fully harness the benefits of MoE, a systematic evaluation of hardware acceleration techniques is essential. We present MoE-Inference-Bench, a comprehensive study to evaluate MoE performance across diverse scenarios. We analyze the impact of batch size, sequence length, and critical MoE hyperparameters such as FFN dimensions and number of experts on throughput. We evaluate several optimization techniques on Nvidia H100 GPUs, including pruning, Fused MoE operations, speculative decoding, quantization, and various parallelization strategies. Our evaluation includes MoEs from the Mixtral, DeepSeek, OLMoE and Qwen families. The results reveal performance differences across configurations and provide insights for the efficient deployment of MoEs.
Problem

Research questions and friction points this paper is trying to address.

Evaluating hardware acceleration for MoE model inference challenges
Analyzing impact of hyperparameters on MoE throughput performance
Assessing optimization techniques for efficient MoE deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic evaluation of hardware acceleration techniques
Analysis of batch size and sequence length impact
Evaluation of pruning and quantization optimization techniques
🔎 Similar Papers
No similar papers found.