Implementing Multi-GPU Scientific Computing Miniapps Across Performance Portable Frameworks

📅 2024-11-27

🏛️ 2024 IEEE 42nd Central America and Panama Convention (CONCAPAN XLII)

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

To address performance portability challenges for scientific computing applications on heterogeneous architectures, this work systematically compares four portable programming models—Kokkos, OpenMP, RAJA, and OCCA—on a unified hardware platform (a single node with four A100 GPUs), evaluating end-to-end performance across two representative miniapps: N-body simulation and structured-grid computation. Our methodology unifies multi-GPU implementation across frameworks via just-in-time compilation, multi-GPU acceleration, distributed-memory parallelism, and efficient data synchronization. Results reveal that OCCA achieves superior performance for small-scale problems, while OpenMP exhibits significant bottlenecks on structured grids; substantial performance disparities exist among all frameworks. Critical bottlenecks are identified in inter-GPU communication overhead and reduction algorithm efficiency. This study establishes an empirical benchmark and methodological foundation for framework selection and co-optimization in heterogeneous high-performance scientific computing.

Technology Category

Application Category

📝 Abstract

Scientific computing in the exascale era demands increased computational power to solve complex problems across various domains. With the rise of heterogeneous computing architectures the need for vendor-agnostic, performance portability frameworks has been highlighted. Libraries like Kokkos have become essential for enabling high-performance computing applications to execute efficiently across different hardware platforms with minimal code changes. In this direction, this paper presents preliminary time-to-solution results for two representative scientific computing applications: an N-body simulation and a structured grid simulation. Both applications used a distributed memory approach and hardware acceleration through four performance portability frameworks: Kokkos, OpenMP, RAJA, and OCCA. Experiments conducted on a single node of the Polaris supercomputer using four NVIDIA A100 GPUs revealed significant performance variability among frameworks. OCCA demonstrated faster execution times for small-scale validation problems, likely due to JIT compilation, however its lack of optimized reduction algorithms may limit scalability for larger simulations while using its out of the box API. OpenMP performed poorly in the structured grid simulation most likely due to inefficiencies in inter-node data synchronization and communication. These findings highlight the need for further optimization to maximize each framework’s capabilities. Future work will focus on enhancing reduction algorithms, data communication, memory management, as wells as performing scalability studies, and a comprehensive statistical analysis to evaluate and compare framework performance.

Problem

Research questions and friction points this paper is trying to address.

Evaluating performance portability frameworks for multi-GPU scientific computing applications

Comparing execution efficiency across Kokkos, OpenMP, RAJA and OCCA frameworks

Identifying optimization needs in reduction algorithms and data communication

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-GPU implementation across four portability frameworks

Distributed memory approach with hardware acceleration

Performance comparison using N-body and structured grid simulations

🔎 Similar Papers

Taking GPU Programming Models to Task for Performance Portability