CodegenBench: Can LLMs Write Efficient Code Across Architectures?

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the limited understanding of large language models’ (LLMs’) capability in generating efficient code for CPU-oriented high-performance computing and domestic heterogeneous architectures. The authors present CodegenBench, the first systematic benchmark evaluating mainstream LLMs on their ability to generate performant parallel code across three hardware platforms: x86_64, Sunway, and Kunpeng, covering 106 BLAS routines and 60 computational kernels. Through automated cross-architecture compilation, runtime profiling, and performance analysis, the study reveals that while LLMs can produce highly efficient code on well-documented x86_64 systems, their performance degrades significantly on specialized architectures with sparse documentation. The models achieve best results on tasks of moderate complexity and short implementation length. The released dataset and evaluation framework fill a critical gap in LLM research for domestic supercomputing platforms.

📝 Abstract

While large language models (LLMs) have been extensively evaluated on code generation tasks for general-purpose programming and GPU-accelerated environments (e.g., PyTorch, CUDA), their capabilities in CPU-oriented high-performance computing (HPC) across diverse architectures remain underexplored. To bridge this gap, we introduce CodegenBench, a comprehensive benchmark suite designed to evaluate the generation of efficient parallel code across three distinct hardware platforms: x86_64, Sunway, and Kunpeng. Our benchmark comprises 106 standard Basic Linear Algebra Subprograms (BLAS) routines establishing a fundamental baseline, alongside 20 specialized computational kernels adapted for each of the unique supercomputing architectures (LeetSunway and LeetKunpeng). Our extensive evaluation reveals that while state-of-the-art LLMs can generate optimized code for ubiquitous architectures like x86_64, they exhibit significant performance degradation on domain-specific architectures with limited public documentation and training data, highlighting critical limitations in cross-platform generalization. Furthermore, our analysis of factors influencing code quality such as implementation length and task complexity indicates that current LLMs are most effective for moderately difficult problems requiring concise code snippets. We open-source our dataset and automated evaluation infrastructure to facilitate future research in LLM-driven high-performance code generation. The resources are available at https://anonymous.4open.science/r/CodegenBench-EDE1/ and https://anonymous.4open.science/r/CodegenBenchDataset-2551.

Problem

Research questions and friction points this paper is trying to address.

large language models

code generation

high-performance computing

cross-architecture generalization

CPU-oriented HPC

Innovation

Methods, ideas, or system contributions that make the work stand out.

CodegenBench

large language models

high-performance computing