KernelBlaster: Continual Cross-Task CUDA Optimization via Memory-Augmented In-Context Reinforcement Learning

📅 2026-02-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of CUDA code optimization across heterogeneous GPU architectures, where the vast search space, reliance of traditional compilers on fixed heuristics, and high cost of fine-tuning large language models (LLMs) hinder effective and cumulative optimization. To overcome these limitations, we propose KernelBlaster, a novel framework featuring memory-augmented in-context reinforcement learning (MAIC-RL). This mechanism integrates a retrievable, persistent CUDA knowledge base with performance-profile-driven textual gradients, enabling LLM agents to continuously learn from historical optimization experiences and systematically explore high-performance strategies. Evaluated on the three-tier KernelBench benchmark, our approach achieves geometric mean speedups of 1.43×, 2.50×, and 1.50× over the PyTorch baseline, respectively. The complete framework and evaluation pipeline are publicly released.

Technology Category

Application Category

📝 Abstract
Optimizing CUDA code across multiple generations of GPU architectures is challenging, as achieving peak performance requires an extensive exploration of an increasingly complex, hardware-specific optimization space. Traditional compilers are constrained by fixed heuristics, whereas finetuning Large Language Models (LLMs) can be expensive. However, agentic workflows for CUDA code optimization have limited ability to aggregate knowledge from prior exploration, leading to biased sampling and suboptimal solutions. We propose KernelBlaster, a Memory-Augmented In-context Reinforcement Learning (MAIC-RL) framework designed to improve CUDA optimization search capabilities of LLM-based GPU coding agents. KernelBlaster enables agents to learn from experience and make systematically informed decisions on future tasks by accumulating knowledge into a retrievable Persistent CUDA Knowledge Base. We propose a novel profile-guided, textual-gradient-based agentic flow for CUDA generation and optimization to achieve high performance across generations of GPU architectures. KernelBlaster guides LLM agents to systematically explore high-potential optimization strategies beyond naive rewrites. Compared to the PyTorch baseline, our method achieves geometric mean speedups of 1.43x, 2.50x, and 1.50x on KernelBench Levels 1, 2, and 3, respectively. We release KernelBlaster as an open-source agentic framework, accompanied by a test harness, verification components, and a reproducible evaluation pipeline.
Problem

Research questions and friction points this paper is trying to address.

CUDA optimization
cross-task learning
GPU architectures
knowledge accumulation
LLM-based code generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Memory-Augmented In-context Reinforcement Learning
CUDA Optimization
Persistent Knowledge Base
Profile-Guided Code Generation
LLM-based GPU Coding Agent