AscendKernelGen: A Systematic Study of LLM-Based Kernel Generation for Neural Processing Units

📅 2026-01-12
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of generating high-performance compute kernels for Ascend NPUs using general-purpose large language models, which struggle due to insufficient domain-specific data and the stringent constraints of NPU architectures. To overcome this, the authors propose an integrated generation-and-evaluation framework that introduces the Ascend-CoT dataset with chain-of-thought reasoning, develops a domain-adaptive language model named KernelGen-LM, and establishes a multidimensional evaluation benchmark, NPUKernelBench. By combining supervised fine-tuning, execution-feedback reinforcement learning, and joint compilation-functional-performance evaluation, the approach achieves a 95.5% compilation success rate (Pass@10) and 64.3% functional correctness on complex Level-2 kernels—substantially outperforming baseline methods and marking the first successful demonstration of highly reliable LLM-based automatic kernel generation for NPUs.

Technology Category

Application Category

📝 Abstract
To meet the ever-increasing demand for computational efficiency, Neural Processing Units (NPUs) have become critical in modern AI infrastructure. However, unlocking their full potential requires developing high-performance compute kernels using vendor-specific Domain-Specific Languages (DSLs), a task that demands deep hardware expertise and is labor-intensive. While Large Language Models (LLMs) have shown promise in general code generation, they struggle with the strict constraints and scarcity of training data in the NPU domain. Our preliminary study reveals that state-of-the-art general-purpose LLMs fail to generate functional complex kernels for Ascend NPUs, yielding a near-zero success rate. To address these challenges, we propose AscendKernelGen, a generation-evaluation integrated framework for NPU kernel development. We introduce Ascend-CoT, a high-quality dataset incorporating chain-of-thought reasoning derived from real-world kernel implementations, and KernelGen-LM, a domain-adaptive model trained via supervised fine-tuning and reinforcement learning with execution feedback. Furthermore, we design NPUKernelBench, a comprehensive benchmark for assessing compilation, correctness, and performance across varying complexity levels. Experimental results demonstrate that our approach significantly bridges the gap between general LLMs and hardware-specific coding. Specifically, the compilation success rate on complex Level-2 kernels improves from 0% to 95.5% (Pass@10), while functional correctness achieves 64.3% compared to the baseline's complete failure. These results highlight the critical role of domain-specific reasoning and rigorous evaluation in automating accelerator-aware code generation.
Problem

Research questions and friction points this paper is trying to address.

NPU
Kernel Generation
Large Language Models
Domain-Specific Language
Code Generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

NPU kernel generation
domain-adaptive LLM
chain-of-thought reasoning
execution-guided reinforcement learning
hardware-aware code generation
🔎 Similar Papers
No similar papers found.
Xinzi Cao
Xinzi Cao
SUN YAT-SEN UNIVERSITY
Unsupervised/Semi-supervised LearningGeneralized Category Discovery
J
Jianyang Zhai
Pengcheng Laboratory
P
Pengfei Li
Huawei
Z
Zhiheng Hu
Huawei
C
Cen Yan
Huawei
B
Bingxu Mu
Huawei
G
Guanghuan Fang
Huawei
B
Bin She
Huawei
J
Jiayu Li
Huawei
Y
Yihan Su
Huawei
D
Dongyang Tao
Huawei
X
Xiansong Huang
Pengcheng Laboratory
F
Fan Xu
Pengcheng Laboratory
F
Feidiao Yang
Pengcheng Laboratory
Y
Yao Lu
Pengcheng Laboratory
C
Chang-Dong Wang
Sun Yat-sen University
Y
Yutong Lu
Sun Yat-sen University
W
Weicheng Xue
Pengcheng Laboratory
B
Bin Zhou
Pengcheng Laboratory
Y
Yonghong Tian
Peking University