🤖 AI Summary
To address the challenge of simultaneously achieving high performance and portability for Poisson equation solvers on heterogeneous HPC hardware, this paper designs and implements a parallel Poisson solver based on a preconditioned Bi-CGSTAB algorithm. Methodologically, it integrates MPI-based distributed computing with the alpaka cross-platform heterogeneous programming framework, introducing a novel communication-free Chebyshev preconditioner combined with Block Jacobi splitting for efficient parallelism. Key contributions include: (1) the preconditioner accelerates convergence significantly, yielding over 6× overall speedup; (2) hardware-agnostic high performance is achieved, with <5% performance variation across diverse GPU architectures—including NVIDIA H100 and AMD MI250X; and (3) single-GPU nodes deliver up to 50× speedup over CPU-only execution, while strong scaling across 64 GPUs achieves >90% efficiency. The design’s effectiveness is rigorously validated via deep performance profiling using Omnitrace.
📝 Abstract
This paper presents the design, implementation, and performance analysis of a parallel and GPU-accelerated Poisson solver based on the Preconditioned Bi-Conjugate Gradient Stabilized (Bi-CGSTAB) method. The implementation utilizes the MPI standard for distributed-memory parallelism, while on-node computation is handled using the alpaka framework: this ensures both shared-memory parallelism and inherent performance portability across different hardware architectures. We evaluate the solver's performances on CPUs and GPUs (NVIDIA Hopper H100 and AMD MI250X), comparing different preconditioning strategies, including Block Jacobi and Chebyshev iteration, and analyzing the performances both at single and multi-node level. The execution efficiency is characterized with a strong scaling test and using the AMD Omnitrace profiling tool. Our results indicate that a communication-free preconditioner based on the Chebyshev iteration can speed up the solver by more than six times. The solver shows comparable performances across different GPU architectures, achieving a speed-up in computation up to 50 times compared to the CPU implementation. In addition, it shows a strong scaling efficiency greater than 90% up to 64 devices.