🤖 AI Summary
This study addresses the challenge of performance portability in computational fluid dynamics (CFD) on heterogeneous supercomputing architectures. Focusing on the SOD2D spectral element CFD framework from the REFMAP project, it presents the first systematic evaluation of cross-vendor performance portability between AMD and NVIDIA multi-GPU platforms in the context of urban wind flow prediction. A full-stack analysis spanning application, software, and hardware layers is conducted, leveraging vendor-specific compiler stacks and the LUMI multi-GPU cluster for optimization and scalability experiments. Results reveal significant performance disparities—single-GPU optimizations yield speedups ranging from 0.69× to 3.91×, while multi-GPU throughput exhibits substantial variability—highlighting the limitations of current performance prediction models and underscoring the necessity of holistic, multi-level co-optimization.
📝 Abstract
As heterogeneous supercomputing architectures leveraging GPUs become increasingly central to high-performance computing (HPC), it is crucial for computational fluid dynamics (CFD) simulations, a de-facto HPC workload, to efficiently utilize such hardware. One of the key challenges of HPC codes is performance portability, i.e. the ability to maintain near-optimal performance across different accelerators. In the context of the \textbf{REFMAP} project, which targets scalable, GPU-enabled multi-fidelity CFD for urban airflow prediction, this paper analyzes the performance portability of SOD2D, a state-of-the-art Spectral Elements simulation framework across AMD and NVIDIA GPU architectures. We first discuss the physical and numerical models underlying SOD2D, highlighting its computational hotspots. Then, we examine its performance and scalability in a multi-level manner, i.e. defining and characterizing an extensive full-stack design space spanning across application, software and hardware infrastructure related parameters. Single-GPU performance characterization across server-grade NVIDIA and AMD GPU architectures and vendor-specific compiler stacks, show the potential as well as the diverse effect of memory access optimizations, i.e. 0.69$\times$ - 3.91$\times$ deviations in acceleration speedup. Performance variability of SOD2D at scale is further examined on the LUMI multi-GPU cluster, where profiling reveals similar throughput variations, highlighting the limits of performance projections and the need for multi-level, informed tuning.