MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPU

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

This work addresses the challenges faced by current large language models in generating efficient and correct native GPU kernel code, as well as the instability and sparse reward issues inherent in execution-based reinforcement learning. We propose MusaCoder, a full-stack training framework tailored for CUDA and MUSA backends, which enhances code generation quality through progressive kernel synthesis, diversity-preserving rejection fine-tuning, and execution-feedback-driven reinforcement learning. Our approach introduces three key innovations: PrimeEcho, a multi-round reward mechanism; Buffered Dynamic Retry, a hard-example recovery strategy; and MirrorPop, an off-policy sequence filtering method—collectively improving training stability. Evaluated on KernelBench and a MUSA porting benchmark, our 9B model matches leading closed-source systems, while the 27B variant establishes a new state of the art, significantly outperforming both open-source and commercial baselines in correctness and empirical speedup.

📝 Abstract

Native GPU kernel generation turns high-level tensor programs into executable, efficient low-level code. Existing Large Language Models (LLMs) struggle with this task, while execution-based reinforcement learning suffers from sparse rewards, reward hacking, and training instability. We present MusaCoder, a full-stack training framework for native GPU kernel generation on CUDA and MUSA backends. MusaCoder combines progressive kernel-oriented data synthesis, diversity-preserving rejection fine-tuning, and execution-feedback Reinforcement Learning (RL) through MooreEval, a distributed verifier and reward environment. To stabilize RL, MusaCoder introduces PrimeEcho for first-turn-anchored multi-turn rewards, Buffered Dynamic Retry for recovering signals from all-failed hard samples, and MirrorPop for off-policy sequence filtering. Experiments on KernelBench and a MUSA-ported variant show that MusaCoder outperforms strong open-source and proprietary baselines in both correctness and empirical speedup, with the 9B model matching or exceeding frontier closed-source models and the 27B model establishing a new state of the art. These results demonstrate not only the effectiveness of full-stack execution-feedback training for native kernel generation, but also the capability of Moore Threads GPUs to support the complete LLM post-training stack, providing a practical foundation for large-model training and optimization on emerging accelerators.

Problem

Research questions and friction points this paper is trying to address.

native GPU kernel generation

Large Language Models

reinforcement learning

sparse rewards

training instability

Innovation

Methods, ideas, or system contributions that make the work stand out.

native GPU kernel generation

full-stack training

execution-feedback reinforcement learning