A Parallel and Highly-Portable HPC Poisson Solver: Preconditioned Bi-CGSTAB with alpaka

📅 2025-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of simultaneously achieving high performance and portability for Poisson equation solvers on heterogeneous HPC hardware, this paper designs and implements a parallel Poisson solver based on a preconditioned Bi-CGSTAB algorithm. Methodologically, it integrates MPI-based distributed computing with the alpaka cross-platform heterogeneous programming framework, introducing a novel communication-free Chebyshev preconditioner combined with Block Jacobi splitting for efficient parallelism. Key contributions include: (1) the preconditioner accelerates convergence significantly, yielding over 6× overall speedup; (2) hardware-agnostic high performance is achieved, with <5% performance variation across diverse GPU architectures—including NVIDIA H100 and AMD MI250X; and (3) single-GPU nodes deliver up to 50× speedup over CPU-only execution, while strong scaling across 64 GPUs achieves >90% efficiency. The design’s effectiveness is rigorously validated via deep performance profiling using Omnitrace.

Technology Category

Application Category

📝 Abstract
This paper presents the design, implementation, and performance analysis of a parallel and GPU-accelerated Poisson solver based on the Preconditioned Bi-Conjugate Gradient Stabilized (Bi-CGSTAB) method. The implementation utilizes the MPI standard for distributed-memory parallelism, while on-node computation is handled using the alpaka framework: this ensures both shared-memory parallelism and inherent performance portability across different hardware architectures. We evaluate the solver's performances on CPUs and GPUs (NVIDIA Hopper H100 and AMD MI250X), comparing different preconditioning strategies, including Block Jacobi and Chebyshev iteration, and analyzing the performances both at single and multi-node level. The execution efficiency is characterized with a strong scaling test and using the AMD Omnitrace profiling tool. Our results indicate that a communication-free preconditioner based on the Chebyshev iteration can speed up the solver by more than six times. The solver shows comparable performances across different GPU architectures, achieving a speed-up in computation up to 50 times compared to the CPU implementation. In addition, it shows a strong scaling efficiency greater than 90% up to 64 devices.
Problem

Research questions and friction points this paper is trying to address.

Design and implement a parallel GPU-accelerated Poisson solver.
Evaluate performance across CPUs and GPUs with different preconditioners.
Achieve high performance portability and strong scaling efficiency.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel GPU-accelerated Poisson solver using Bi-CGSTAB
MPI and alpaka for distributed and shared-memory parallelism
Chebyshev iteration preconditioner boosts solver speed sixfold
Luca Pennati
Luca Pennati
KTH - Royal Institute of Technology
High-Performance ComputingPlasma Physics
M
Maans I. Andersson
KTH Royal Institute of Technology, Stockholm, Sweden
K
K. Steiniger
Helmholtz-Zentrum Dresden-Rossendorf (HZDR), Dresden, Germany
R
R. Widera
Helmholtz-Zentrum Dresden-Rossendorf (HZDR), Dresden, Germany
T
Tapish Narwal
Helmholtz-Zentrum Dresden-Rossendorf (HZDR), Dresden, Germany
Michael Bussmann
Michael Bussmann
Center for Advanced Systems Understanding
matter under extreme conditionsaccelerator physicshigh performance computingartificial intelligencemedical physics
Stefano Markidis
Stefano Markidis
Professor, KTH Royal Institute of Technology
High Performance ComputingComputational Plasma PhysicsQuantum Computing