Parallel GPU-Enabled Algorithms for SpGEMM on Arbitrary Semirings with Hybrid Communication

📅 2025-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the computational demands of large-scale sparse general matrix-matrix multiplication (SpGEMM) on distributed heterogeneous supercomputers, this paper introduces the first GPU-accelerated distributed SpGEMM framework supporting arbitrary semiring algebras. To alleviate communication bottlenecks, we propose a novel dynamic message-size-aware hybrid communication mechanism that enables real-time adaptive switching between host-to-host and device-to-device communication paths. Furthermore, we present the first full GPU port of CombBLAS, integrating doubly compressed sparse columns (DCSC) storage with semiring-abstracted computation. Experimental evaluation demonstrates that our framework achieves over 2× speedup over CPU-based CombBLAS and up to 3× improvement over PETSc, significantly reducing end-to-end execution time. The approach is particularly effective for high-performance computing applications such as genomics and graph analytics.

Technology Category

Application Category

📝 Abstract
Sparse General Matrix Multiply (SpGEMM) is key for various High-Performance Computing (HPC) applications such as genomics and graph analytics. Using the semiring abstraction, many algorithms can be formulated as SpGEMM, allowing redefinition of addition, multiplication, and numeric types. Today large input matrices require distributed memory parallelism to avoid disk I/O, and modern HPC machines with GPUs can greatly accelerate linear algebra computation. In this paper, we implement a GPU-based distributed-memory SpGEMM routine on top of the CombBLAS library. Our implementation achieves a speedup of over 2x compared to the CPU-only CombBLAS implementation and up to 3x compared to PETSc for large input matrices. Furthermore, we note that communication between processes can be optimized by either direct host-to-host or device-to-device communication, depending on the message size. To exploit this, we introduce a hybrid communication scheme that dynamically switches data paths depending on the message size, thus improving runtimes in communication-bound scenarios.
Problem

Research questions and friction points this paper is trying to address.

Develop GPU-based distributed SpGEMM for large matrices
Optimize hybrid communication for varying message sizes
Accelerate semiring-based algorithms in HPC applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

GPU-based distributed-memory SpGEMM implementation
Hybrid communication scheme for optimized data transfer
Dynamic switching between host-host and device-device paths
🔎 Similar Papers
No similar papers found.