Efficient Edge LLMs Deployment via HessianAware Quantization and CPU GPU Collaborative

📅 2025-08-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Deploying Mixture-of-Experts (MoE) large language models on edge devices faces two key challenges: (1) severe accuracy degradation in low-bit quantization due to outlier-heavy activation distributions, and (2) difficulty in balancing latency and throughput during expert offloading under stringent memory constraints. To address these, this work proposes a joint optimization framework integrating Hessian-aware quantization and CPU–GPU collaborative inference. Specifically, it introduces smooth Hessian-guided joint 8-bit quantization of activations and weights to effectively suppress outlier impacts; and designs a dynamic expert offloading scheduler based on per-expert activation path statistics, enabling fine-grained, expert-level CPU–GPU collaboration. Evaluated on OPT and Mixtral 8×7B, the method achieves near-full-precision accuracy post-quantization, reduces GPU memory usage by 60%, and significantly lowers end-to-end inference latency—demonstrating strong efficacy for edge deployment on Wikitext2 and C4 benchmarks.

Technology Category

Application Category

📝 Abstract

With the breakthrough progress of large language models (LLMs) in natural language processing and multimodal tasks, efficiently deploying them on resource-constrained edge devices has become a critical challenge. The Mixture of Experts (MoE) architecture enhances model capacity through sparse activation, but faces two major difficulties in practical deployment: (1) The presence of numerous outliers in activation distributions leads to severe degradation in quantization accuracy for both activations and weights, significantly impairing inference performance; (2) Under limited memory, efficient offloading and collaborative inference of expert modules struggle to balance latency and throughput. To address these issues, this paper proposes an efficient MoE edge deployment scheme based on Hessian-Aware Quantization (HAQ) and CPU-GPU collaborative inference. First, by introducing smoothed Hessian matrix quantization, we achieve joint 8-bit quantization of activations and weights, which significantly alleviates the accuracy loss caused by outliers while ensuring efficient implementation on mainstream hardware. Second, we design an expert-level collaborative offloading and inference mechanism, which, combined with expert activation path statistics, enables efficient deployment and scheduling of expert modules between CPU and GPU, greatly reducing memory footprint and inference latency. Extensive experiments validate the effectiveness of our method on mainstream large models such as the OPT series and Mixtral 8*7B: on datasets like Wikitext2 and C4, the inference accuracy of the low-bit quantized model approaches that of the full-precision model, while GPU memory usage is reduced by about 60%, and inference latency is significantly improved.

Problem

Research questions and friction points this paper is trying to address.

Quantization accuracy loss due to activation outliers

Balancing latency and throughput in MoE deployment

Efficient CPU-GPU collaboration for edge LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hessian-Aware Quantization for 8-bit precision

CPU-GPU collaborative expert offloading

Smoothed Hessian matrix reduces outlier impact

🔎 Similar Papers

EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for the Acceleration of Lightweight LLMs on the Edge

2024-02-16arXiv.orgCitations: 22

Attention-aware Post-training Quantization without Backpropagation

2024-06-19arXiv.orgCitations: 0

Authors to Follow