DAOP: Data-Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference

📅 2024-12-16

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Deploying Mixture-of-Experts (MoE) models on memory-constrained devices faces severe GPU memory bottlenecks, triggering frequent CPU–GPU data transfers and drastically degrading inference speedup. To address this, we propose a heterogeneous collaborative execution framework for efficient MoE inference. Our approach features: (1) a novel sequence-level activation–aware dynamic expert offloading and allocation strategy; (2) a predictive CPU pre-computation mechanism that proactively executes pending experts; and (3) a precision-preserving progressive degradation scheme. By adaptively optimizing expert cache ratio and orchestrating cross-device scheduling, our framework achieves up to 8.20× speedup over conventional caching/prefetching baselines and 1.35× over state-of-the-art offloading methods across multiple datasets—without any accuracy loss.

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) models, though highly effective for various machine learning tasks, face significant deployment challenges on memory-constrained devices. While GPUs offer fast inference, their limited memory compared to CPUs means not all experts can be stored on the GPU simultaneously, necessitating frequent, costly data transfers from CPU memory, often negating GPU speed advantages. To address this, we present DAOP, an on-device MoE inference engine to optimize parallel GPU-CPU execution. DAOP dynamically allocates experts between CPU and GPU based on per-sequence activation patterns, and selectively pre-calculates predicted experts on CPUs to minimize transfer latency. This approach enables efficient resource utilization across various expert cache ratios while maintaining model accuracy through a novel graceful degradation mechanism. Comprehensive evaluations across various datasets show that DAOP outperforms traditional expert caching and prefetching methods by up to 8.20x and offloading techniques by 1.35x while maintaining accuracy.

Problem

Research questions and friction points this paper is trying to address.

Optimizes GPU-CPU execution for memory-constrained MoE inference

Reduces costly data transfers between CPU and GPU

Maintains model accuracy with dynamic expert allocation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic expert allocation based on activation patterns

Predictive pre-calculation to minimize transfer latency

Graceful degradation mechanism for accuracy maintenance

🔎 Similar Papers

No similar papers found.

Authors to Follow