Graph Traversal on Tensor Cores: A BFS Framework for Modern GPUs

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the challenges of efficiently harnessing GPU Tensor Cores for graph algorithms, which are typically hindered by irregular execution patterns, load imbalance, and high synchronization overhead. The authors reformulate breadth-first search (BFS) as a bit-level sparse matrix-vector multiplication and introduce a novel Binary Virtual Slice Set (BVSS) graph representation. By integrating lazy vertex updates, dynamic switching between Tensor Core and CUDA execution, and graph reordering techniques—specifically Reverse Cuthill–McKee (RCM) and compression—they substantially enhance memory efficiency and parallelism. Optimized Tensor Core memory layouts reduce MMA invocations by 8×. Evaluated on diverse real-world graphs, their approach achieves average speedups of 22.0×, 7.7×, 8.1×, and 5.9× over GAP, Gunrock, GSWITCH, and BerryBees, respectively, and computes exact closeness centrality for a 3.6-billion-edge social network within one hour using 100 H100 GPUs.

📝 Abstract

Modern GPUs have Tensor Cores (TCs) capable of extremely high-throughput matrix operations, yet graph algorithms remain difficult to accelerate because of their irregular and data-dependent execution patterns. This work presents BLEST, a TC-accelerated framework that reformulates Breadth-First Search (BFS) as a bit-level sparse matrix-vector computation while addressing the load imbalance, memory inefficiency, and synchronization overheads that limit prior approaches. BLEST introduces Binarized Virtual Slice Sets (BVSS), a graph representation that partitions work into balanced warp-level units and schedules only frontier-relevant regions of the graph. It further employs an optimized TC layout that maps neighbour checks onto binary MMA instructions without wasted outputs, reducing the number of required MMA calls by 8$\times$ compared with prior layouts. To mitigate atomic and cache bottlenecks, BLEST incorporates a lazy vertex-update scheme. We revisit the switching terminology for BFS and propose a mechanism that dynamically transitions from TCs to CUDA cores when it becomes more efficient. We also extend BLEST to multi-source BFS and closeness centrality workloads. Finally, we introduce a scalable graph reordering method that improves compression for scale-free-like graphs, while using RCM to improve locality for others. Across a broad set of real-world graphs, BLEST achieves average speedups of 22.0$\times$, 7.7$\times$, 8.1$\times$, and 5.9$\times$ over GAP, Gunrock, GSWITCH, and BerryBees, respectively, establishing a new BFS baseline on GPUs. Thanks to its high performance, BLEST can compute the exact closeness centralities of 65.6M vertices in a social network with 3.6B edges in an hour using 100 H100 GPUs.

Problem

Research questions and friction points this paper is trying to address.

Graph Traversal

Tensor Cores

Breadth-First Search

GPU Acceleration

Irregular Computation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tensor Cores

Breadth-First Search

Sparse Matrix-Vector Multiplication