SparStencil: Retargeting Sparse Tensor Cores to Scientific Stencil Computations via Structured Sparsity Transformation

📅 2025-06-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Stencil computations in scientific computing exhibit irregular sparse access patterns that are inherently incompatible with GPU sparse tensor cores—such as 2:4 structured-sparse hardware—which require regular, hardware-aligned sparsity. This work pioneers the integration of sparse tensor cores into stencil computation. We propose a holistic optimization framework comprising adaptive layout deformation, structured-sparsity conversion, and automatic kernel generation. Our approach employs a *flatten-and-crush* pipeline, graph-matching–based modeling, layout search, and table-driven memory mapping to efficiently transform irregular stencil sparsity into hardware-friendly structured formats. Evaluated on 79 stencil kernels, our method achieves an average speedup of 3.1× (up to 7.1×) over baseline dense implementations, significantly reducing development complexity while matching or surpassing expert hand-tuned performance.

Technology Category

Application Category

📝 Abstract
Sparse Tensor Cores offer exceptional performance gains for AI workloads by exploiting structured 2:4 sparsity. However, their potential remains untapped for core scientific workloads such as stencil computations, which exhibit irregular sparsity patterns.This paper presents SparStencil, the first system to retarget sparse TCUs for scientific stencil computations through structured sparsity transformation. SparStencil introduces three key techniques: (1) Adaptive Layout Morphing, which restructures stencil patterns into staircase-aligned sparse matrices via a flatten-and-crush pipeline; (2) Structured Sparsity Conversion, which formulates transformation as a graph matching problem to ensure compatibility with 2:4 sparsity constraints; (3) Automatic Kernel Generation, which compiles transformed stencils into optimized sparse MMA kernels via layout search and table-driven memory mapping. Evaluated on 79 stencil kernels spanning diverse scientific domains, SparStencil achieves up to 7.1x speedup (3.1x on average) over state-of-the-art framework while reducing code complexity and matching or exceeding expert-tuned performance in both compute throughput and memory efficiency.
Problem

Research questions and friction points this paper is trying to address.

Adapting sparse tensor cores for scientific stencil computations
Transforming irregular sparsity to structured 2:4 sparsity patterns
Optimizing performance and memory efficiency in stencil kernels
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Layout Morphing for stencil restructuring
Structured Sparsity Conversion via graph matching
Automatic Kernel Generation with layout search
🔎 Similar Papers
No similar papers found.
Q
Qi Li
University of Science and Technology of China
K
Kun Li
Microsoft Research
H
Haozhi Han
Peking University
L
Liang Yuan
Chinee Academy of Sciences
J
Junshi Chen
University of Science and Technology of China
Yunquan Zhang
Yunquan Zhang
Professor of Institute of Computing Technology, CAS
parallel computingparallel programmingparallel computational model
Y
Yifeng Chen
Peking University
H
Hong An
University of Science and Technology of China
T
Ting Cao
Tsinghua University
M
Mao Yang
Microsoft Research