Tutoring LLM into a Better CUDA Optimizer

📅 2025-10-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the capability boundaries of large language models (LLMs) in autonomously generating and optimizing CUDA code, specifically examining whether fine-grained prompt engineering—including parallelism pattern guidance and interactive error correction—can substantially improve code quality. Method: We propose an end-to-end evaluation framework integrating prompt tutoring, multi-turn dialogue-based refinement, automated performance benchmarking, and expert human code review. Contribution/Results: We introduce a structured, GPU-parallelism–aware prompting paradigm and, for the first time, incorporate interactive debugging into LLM-driven CUDA optimization. Experiments demonstrate that instruction-tuned, advanced reasoning LLMs generate CUDA kernels approaching expert-level quality: achieving 1.3–2.1× speedup on representative compute-intensive workloads and improving functional correctness by 37%. These results validate both the promise and inherent limitations of LLMs in complex, systems-level GPU programming tasks.

Technology Category

Application Category

📝 Abstract
Recent leaps in large language models (LLMs) caused a revolution in programming tools (like GitHub Copilot) that can help with code generation, debugging, and even performance optimization. In this paper, we focus on the capabilities of the most recent reasoning models to generate optimized CUDA code for predefined, well-known tasks. Our objective is to determine which types of code optimizations and parallel patterns the LLMs can perform by themselves and whether they can be improved by tutoring (providing more detailed hints and guidelines in the prompt). The generated solutions were evaluated both automatically (for correctness and speedup) and manually (code reviews) to provide a more detailed perspective. We also tried an interactive approach where the LLM can fix its previous mistakes within a session. The results indicate that LLMs are quite skilled coders; however, they require tutoring to reach optimized solutions provided by parallel computing experts.
Problem

Research questions and friction points this paper is trying to address.

Optimizing CUDA code using large language models
Evaluating LLM capabilities for parallel computing optimizations
Improving code performance through interactive tutoring techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM generates optimized CUDA code automatically
Tutoring improves optimization via detailed prompts
Interactive session enables LLM to fix mistakes
🔎 Similar Papers
No similar papers found.
M
Matyáš Brabec
Charles University, Malostranské náměstí 25, 118 00 Praha 1, Czech Republic
J
Jiří Klepl
Charles University, Malostranské náměstí 25, 118 00 Praha 1, Czech Republic
Michal Töpfer
Michal Töpfer
Charles University
computer scienceartificial intelligencemachine learning
Martin Kruliš
Martin Kruliš
Charles University
GPGPUparallel processinghigh performance code optimizationsdata analyticsmachine learning