Post-Training Quantization of OpenPangu Models for Efficient Deployment on Atlas A2

📅 2025-12-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high memory consumption and latency in multi-mode Chain-of-Thought (CoT) inference—specifically *slow_think*, *auto_think*, and *no_think*—on openPangu-Embedded-1B/7B models deployed on the Atlas A2 Ascend NPU, this work proposes the first Ascend-platform-native unified INT8/W4A8 post-training quantization framework. Our method jointly optimizes computation and storage by integrating CoT-mode-aware calibration, CANN operator customization, and weight-activation quantization co-design, while preserving inference fidelity. Experimental results demonstrate that INT8 quantization maintains over 90% FP16 accuracy on HumanEval and MBPP benchmarks, with a 1.5× improvement in prefill throughput; W4A8 quantization significantly reduces GPU memory footprint. This is the first work to validate efficient deployment of multi-mode CoT inference on domestic NPUs, establishing a practical pathway for resource-constrained large language model execution on Ascend hardware.

Technology Category

Application Category

📝 Abstract
Huawei's openPangu-Embedded-1B and openPangu-Embedded-7B, variants of the openPangu large language model, integrate three distinct Chain-of-Thought (CoT) reasoning paradigms, namely slow_think, auto_think, and no_think. While these CoT modes enhance reasoning capabilities, their generation of extended reasoning traces introduces substantial memory and latency overheads, posing challenges for practical deployment on Ascend NPUs. This paper addresses these computational constraints by leveraging low-bit quantization, which transforms FP16 computations into more efficient integer arithmetic. We introduce a unified low-bit inference framework, supporting INT8 (W8A8) and W4A8 quantization, specifically optimized for openPangu-Embedded models on the Atlas A2. Our comprehensive evaluation, conducted across all three CoT modes on code generation benchmarks (HumanEval and MBPP), demonstrates the efficacy of this approach. INT8 quantization consistently preserves over 90% of the FP16 baseline accuracy and achieves a 1.5x prefill speedup on the Atlas A2. Furthermore, W4A8 quantization significantly reduces memory consumption, albeit with a moderate trade-off in accuracy. These findings collectively indicate that low-bit quantization effectively facilitates efficient CoT reasoning on Ascend NPUs, maintaining high model fidelity.
Problem

Research questions and friction points this paper is trying to address.

Quantize OpenPangu models for efficient NPU deployment
Reduce memory and latency from Chain-of-Thought reasoning overheads
Maintain high accuracy while accelerating inference with low-bit quantization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-bit quantization transforms FP16 to integer arithmetic
Unified framework supports INT8 and W4A8 quantization
Optimized for openPangu models on Atlas A2 NPUs
🔎 Similar Papers
No similar papers found.
Yilun Luo
Yilun Luo
General Motors
Power ElectronicsPowertrainControlMachine LearningIntelligent Wearable Interfaces
H
HuaQing Zheng
School of Computer Science and Technology, Tianjin University
H
Haoqian Meng
School of Computer Science and Technology, Tianjin University
W
Wenyuan Liu
School of Computer Science and Technology, Tianjin University
P
Peng Zhang
School of Computer Science and Technology, Tianjin University