MoBiLE: Efficient Mixture-of-Experts Inference on Consumer GPU with Mixture of Big Little Experts

πŸ“… 2025-10-14
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
MoE model inference on consumer-grade GPUs is bottlenecked by limited CPU-GPU bandwidth, while existing prefetching techniques suffer from low efficiency under fine-grained expert partitioning and incur substantial training overhead. To address these challenges, this paper proposes MoBiLEβ€”a plug-and-play hybrid expert inference framework. Its core innovations are: (1) a size-aware expert collaboration mechanism that routes non-critical tokens to lightweight small experts for acceleration while reserving large experts for critical tokens to preserve accuracy; and (2) an integrated strategy combining expert offloading, dynamic prefetching, and fine-grained scheduling to optimize memory switching and data transfer. Evaluated on four mainstream MoE models, MoBiLE achieves 1.60×–1.72Γ— inference speedup with negligible accuracy degradation, significantly outperforming state-of-the-art offloading approaches.

Technology Category

Application Category

πŸ“ Abstract
Mixture-of-Experts (MoE) models have recently demonstrated exceptional performance across a diverse range of applications. The principle of sparse activation in MoE models facilitates an offloading strategy, wherein active experts are maintained in GPU HBM, while inactive experts are stored in CPU DRAM. The efficacy of this approach, however, is fundamentally constrained by the limited bandwidth of the CPU-GPU interconnect. To mitigate this bottleneck, existing approaches have employed prefetching to accelerate MoE inference. These methods attempt to predict and prefetch the required experts using specially trained modules. Nevertheless, such techniques are often encumbered by significant training overhead and have shown diminished effectiveness on recent MoE models with fine-grained expert segmentation. In this paper, we propose MoBiLE, a plug-and-play offloading-based MoE inference framework with extit{mixture of big-little experts}. It reduces the number of experts for unimportant tokens to half for acceleration while maintaining full experts for important tokens to guarantee model quality. Further, a dedicated fallback and prefetching mechanism is designed for switching between little and big experts to improve memory efficiency. We evaluate MoBiLE on four typical modern MoE architectures and challenging generative tasks. Our results show that MoBiLE achieves a speedup of 1.60x to 1.72x compared to the baseline on a consumer GPU system, with negligible degradation in accuracy.
Problem

Research questions and friction points this paper is trying to address.

Optimizing MoE inference speed on consumer GPUs
Reducing CPU-GPU bandwidth bottleneck in expert offloading
Maintaining model accuracy while accelerating sparse activation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of big-little experts reduces expert count
Fallback and prefetching mechanism switches expert types
Plug-and-play framework accelerates MoE inference efficiently
πŸ”Ž Similar Papers
No similar papers found.
Y
Yushu Zhao
BNRist, Tsinghua University
Y
Yubin Qin
BNRist, Tsinghua University
Y
Yang Wang
BNRist, Tsinghua University
X
Xiaolong Yang
BNRist, Tsinghua University
H
Huiming Han
BNRist, Tsinghua University
Shaojun Wei
Shaojun Wei
Professor, Tsinghua University
Y
Yang Hu
BNRist, Tsinghua University
Shouyi Yin
Shouyi Yin
Tsinghua University