AgentCompile: An LLM-Guided Compiler for Direct CUDA Inference

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Transformer inference requires semantic-aware specialization decisions in the computation graph to select efficient CUDA implementations, a task that conventional compilers struggle to automate effectively. This work proposes the first approach that integrates a large language model (LLM) as an advisor within the compilation pipeline. The LLM generates semantic metadata—including semantic labels, candidate prioritization, parameter hints, and risk annotations—to guide the compiler in generating and validating CUDA code under hardware constraints, ultimately selecting the best-performing implementation based on empirical latency measurements or safely falling back when necessary. This semantic-driven methodology enables automated CUDA specialization, achieving end-to-end inference speedups of 5.66×, 4.05×, and 4.26× on Qwen3-1.7B, Qwen3-4B, and Llama-3.2-1B-Instruct, respectively.

📝 Abstract

Transformer inference increasingly depends on specialized compiler and runtime support, but real model graphs still require semantic decisions about which regions are worth specializing and which CUDA implementation families are plausible. We present AgentCompile, an LLM-guided CUDA inference compiler that uses LLM outputs only as advisory search metadata. Given compiler-derived region summaries and bounded candidate spaces, the LLM proposes semantic labels, candidate priorities, parameter hints, and risk annotations; the compiler materializes CUDA candidates through templates, checks interface and hardware constraints, validates candidates empirically, selects implementations by measured latency, and falls back when specialization is unsupported or unprofitable. In end-to-end autoregressive generation, AgentCompile averages 5.66x, 4.05x, and 4.26x speedup over PyTorch eager on Qwen3-1.7B, Qwen3-4B, and Llama-3.2-1B-Instruct, respectively, across five representative workloads. We will open-source the project.

Problem

Research questions and friction points this paper is trying to address.

CUDA inference

compiler specialization

Transformer optimization

semantic decisions

LLM-guided compilation

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-guided compilation

CUDA inference

semantic specialization