Task-Based Tensor Computations on Modern GPUs

📅 2025-04-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the challenges of explicit data movement, asynchronous management, and high programming complexity in tensor computations on modern GPUs—particularly NVIDIA’s Hopper architecture—this paper introduces Cypress, a task-driven tensor programming model. Cypress introduces a novel task-level abstraction with sequential semantics, coupled with declarative memory/device mapping and fully automatic compiler scheduling. This enables coordinated offloading to asynchronous hardware units—including the Tensor Memory Accelerator (TMA) and Tensor Cores—while eliminating application-level explicit synchronization, manual data transfers, and concurrency control. Implemented atop a CUDA backend with warp-specialized kernel generation, Cypress achieves 88%–106% of cuBLAS performance on GEMM and 80%–98% of state-of-the-art implementations on Flash Attention. The model significantly improves developer productivity and hardware utilization without sacrificing performance.

Technology Category

Application Category

📝 Abstract

Domain-specific, fixed-function units are becoming increasingly common in modern processors. As the computational demands of applications evolve, the capabilities and programming interfaces of these fixed-function units continue to change. NVIDIA's Hopper GPU architecture contains multiple fixed-function units per compute unit, including an asynchronous data movement unit (TMA) and an asynchronous matrix multiplication unit (Tensor Core). Efficiently utilizing these units requires a fundamentally different programming style than previous architectures; programmers must now develop warp-specialized kernels that orchestrate producer-consumer pipelines between the asynchronous units. To manage the complexity of programming these new architectures, we introduce Cypress, a task-based programming model with sequential semantics. Cypress programs are a set of designated functions called emph{tasks} that operate on emph{tensors} and are free of communication and synchronization. Cypress programs are bound to the target machine through a emph{mapping} specification that describes where tasks should run and in which memories tensors should be materialized. We present a compiler architecture that lowers Cypress programs into CUDA programs that perform competitively with expert-written codes. Cypress achieves 0.88x-1.06x the performance of cuBLAS on GEMM, and between 0.80x-0.98x the performance of the currently best-known Flash Attention implementation while eliminating all aspects of explicit data movement and asynchronous computation from application code.

Problem

Research questions and friction points this paper is trying to address.

Efficiently utilizing fixed-function units in modern GPUs

Simplifying programming for warp-specialized kernels and asynchronous units

Achieving high performance without explicit data movement code

Innovation

Methods, ideas, or system contributions that make the work stand out.

Task-based programming model for GPUs

Warp-specialized kernels for asynchronous units

Compiler lowers tasks to efficient CUDA

🔎 Similar Papers

No similar papers found.

Authors to Follow