🤖 AI Summary
Stencil computations in scientific computing exhibit irregular sparse access patterns that are inherently incompatible with GPU sparse tensor cores—such as 2:4 structured-sparse hardware—which require regular, hardware-aligned sparsity. This work pioneers the integration of sparse tensor cores into stencil computation. We propose a holistic optimization framework comprising adaptive layout deformation, structured-sparsity conversion, and automatic kernel generation. Our approach employs a *flatten-and-crush* pipeline, graph-matching–based modeling, layout search, and table-driven memory mapping to efficiently transform irregular stencil sparsity into hardware-friendly structured formats. Evaluated on 79 stencil kernels, our method achieves an average speedup of 3.1× (up to 7.1×) over baseline dense implementations, significantly reducing development complexity while matching or surpassing expert hand-tuned performance.
📝 Abstract
Sparse Tensor Cores offer exceptional performance gains for AI workloads by exploiting structured 2:4 sparsity. However, their potential remains untapped for core scientific workloads such as stencil computations, which exhibit irregular sparsity patterns.This paper presents SparStencil, the first system to retarget sparse TCUs for scientific stencil computations through structured sparsity transformation. SparStencil introduces three key techniques: (1) Adaptive Layout Morphing, which restructures stencil patterns into staircase-aligned sparse matrices via a flatten-and-crush pipeline; (2) Structured Sparsity Conversion, which formulates transformation as a graph matching problem to ensure compatibility with 2:4 sparsity constraints; (3) Automatic Kernel Generation, which compiles transformed stencils into optimized sparse MMA kernels via layout search and table-driven memory mapping. Evaluated on 79 stencil kernels spanning diverse scientific domains, SparStencil achieves up to 7.1x speedup (3.1x on average) over state-of-the-art framework while reducing code complexity and matching or exceeding expert-tuned performance in both compute throughput and memory efficiency.