🤖 AI Summary
To address the challenges of explicit data movement, asynchronous management, and high programming complexity in tensor computations on modern GPUs—particularly NVIDIA’s Hopper architecture—this paper introduces Cypress, a task-driven tensor programming model. Cypress introduces a novel task-level abstraction with sequential semantics, coupled with declarative memory/device mapping and fully automatic compiler scheduling. This enables coordinated offloading to asynchronous hardware units—including the Tensor Memory Accelerator (TMA) and Tensor Cores—while eliminating application-level explicit synchronization, manual data transfers, and concurrency control. Implemented atop a CUDA backend with warp-specialized kernel generation, Cypress achieves 88%–106% of cuBLAS performance on GEMM and 80%–98% of state-of-the-art implementations on Flash Attention. The model significantly improves developer productivity and hardware utilization without sacrificing performance.
📝 Abstract
Domain-specific, fixed-function units are becoming increasingly common in modern processors. As the computational demands of applications evolve, the capabilities and programming interfaces of these fixed-function units continue to change. NVIDIA's Hopper GPU architecture contains multiple fixed-function units per compute unit, including an asynchronous data movement unit (TMA) and an asynchronous matrix multiplication unit (Tensor Core). Efficiently utilizing these units requires a fundamentally different programming style than previous architectures; programmers must now develop warp-specialized kernels that orchestrate producer-consumer pipelines between the asynchronous units. To manage the complexity of programming these new architectures, we introduce Cypress, a task-based programming model with sequential semantics. Cypress programs are a set of designated functions called emph{tasks} that operate on emph{tensors} and are free of communication and synchronization. Cypress programs are bound to the target machine through a emph{mapping} specification that describes where tasks should run and in which memories tensors should be materialized. We present a compiler architecture that lowers Cypress programs into CUDA programs that perform competitively with expert-written codes. Cypress achieves 0.88x-1.06x the performance of cuBLAS on GEMM, and between 0.80x-0.98x the performance of the currently best-known Flash Attention implementation while eliminating all aspects of explicit data movement and asynchronous computation from application code.