Comparing the Performance of Heterogeneous Conjugate Gradient and Cholesky Solvers on Various Hardware Using SYCL

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

This work addresses the underutilization of CPU resources in modern heterogeneous high-performance computing (HPC) systems when solving large-scale symmetric positive-definite linear systems using GPU-only approaches. Leveraging the SYCL programming model, the authors present the first heterogeneous implementations of the conjugate gradient (CG) method and Cholesky decomposition that operate across multi-vendor CPU-GPU platforms, including NVIDIA, AMD, and Intel architectures. Experimental results demonstrate that the heterogeneous CG solver achieves up to 32% speedup over GPU-only execution on large matrices, while the heterogeneous Cholesky decomposition attains a 29% acceleration. Furthermore, across diverse hardware vendors, the Cholesky solver consistently delivers at least a 12% performance improvement, significantly enhancing both computational efficiency and portability.

📝 Abstract

Many important real-world applications, such as System Identification with Gaussian Processes, involve solving linear systems with symmetric positive-definite matrices. The iterative CG method and direct solvers based on the Cholesky decomposition are two popular methods that can be applied in this case. Since often very large systems have to be solved when dealing with such real-world scenarios, GPUs are commonly used to accelerate the computations. However, homogeneous approaches that only leverage the GPU in the system do not take full advantage of the often powerful CPUs located in modern HPC systems. In this work, we present multi-vendor, heterogeneous implementations of the CG method and the Cholesky decomposition that leverage the CPU and GPU of a heterogeneous system simultaneously using SYCL. Furthermore, we compare their runtime behavior to traditional, homogeneous approaches. The results show that for large matrices, our heterogeneous implementation is up to 32 percent faster for the CG method and up to 29 percent faster for the Cholesky decomposition compared to the corresponding GPU-only implementations. In addition, for large matrices, our heterogeneous implementation of the Cholesky decomposition can achieve at least 12 percent faster runtimes across several systems with GPUs from NVIDIA, AMD, and Intel.

Problem

Research questions and friction points this paper is trying to address.

heterogeneous computing

linear systems

symmetric positive-definite matrices

GPU acceleration

CPU-GPU collaboration

Innovation

Methods, ideas, or system contributions that make the work stand out.

heterogeneous computing

SYCL

conjugate gradient