Accelerating Mixture-of-Expert Inference with Adaptive Expert Split Mechanism

📅 2025-09-10

📈 Citations: 0

✨ Influential: 0

career value

247K/year

🤖 AI Summary

To address high GPU memory consumption and substantial expert loading latency in MoE large-model inference, this paper proposes MoEpic—a novel system featuring vertical expert segmentation, where each expert is partitioned layer-wise into multiple segments. MoEpic introduces an adaptive caching configuration algorithm that identifies hot segments via runtime hotspot analysis, caches top-layer segments selectively, and integrates dynamic prefetching with computation-communication overlap to enhance CPU-GPU collaboration. Evaluated against baseline approaches, MoEpic reduces GPU memory footprint by ~50% while decreasing end-to-end inference latency by 37.51%–65.73%, all without sacrificing model accuracy. Its core contribution lies in being the first to jointly introduce vertical expert partitioning and segment-level hotspot-aware caching for MoE inference optimization—effectively alleviating I/O bottlenecks and establishing a new paradigm for efficient MoE deployment.

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) has emerged as a promising architecture for modern large language models (LLMs). However, massive parameters impose heavy GPU memory (i.e., VRAM) demands, hindering the widespread adoption of MoE LLMs. Offloading the expert parameters to CPU RAM offers an effective way to alleviate the VRAM requirements for MoE inference. Existing approaches typically cache a small subset of experts in VRAM and dynamically prefetch experts from RAM during inference, leading to significant degradation in inference speed due to the poor cache hit rate and substantial expert loading latency. In this work, we propose MoEpic, an efficient MoE inference system with a novel expert split mechanism. Specifically, each expert is vertically divided into two segments: top and bottom. MoEpic caches the top segment of hot experts, so that more experts will be stored under the limited VRAM budget, thereby improving the cache hit rate. During each layer's inference, MoEpic predicts and prefetches the activated experts for the next layer. Since the top segments of cached experts are exempt from fetching, the loading time is reduced, which allows efficient transfer-computation overlap. Nevertheless, the performance of MoEpic critically depends on the cache configuration (i.e., each layer's VRAM budget and expert split ratio). To this end, we propose a divide-and-conquer algorithm based on fixed-point iteration for adaptive cache configuration. Extensive experiments on popular MoE LLMs demonstrate that MoEpic can save about half of the GPU cost, while lowering the inference latency by about 37.51%-65.73% compared to the baselines.

Problem

Research questions and friction points this paper is trying to address.

Reducing GPU memory demands for Mixture-of-Experts inference

Improving cache hit rate and reducing expert loading latency

Optimizing VRAM budget allocation and expert split configuration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vertically splits experts into top and bottom segments

Caches top segments of hot experts in VRAM

Uses adaptive cache configuration with fixed-point iteration

🔎 Similar Papers

On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions