🤖 AI Summary
This work addresses the lack of a unified understanding of NVSHMEM’s system-level design and behavior, which has hindered its efficient use in GPU communication. It establishes NVSHMEM as the pioneering device-side symmetric memory programming model and provides a comprehensive analysis of its programming abstractions, implementation mechanisms, and performance characteristics. The study focuses on symmetric memory management, GPU-initiated one-sided communication, and device-side collective operations, empirically evaluating these features using the DeepEP sparse deep learning workload. The findings reveal NVSHMEM’s critical role and inherent design trade-offs in fine-grained, GPU-driven communication, demonstrate its ability to approach hardware performance limits, and solidify its position as a foundational component for GPU communication systems. Furthermore, the work identifies promising directions for runtime-level optimizations.
📝 Abstract
NVSHMEM is NVIDIA's OpenSHMEM-based PGAS communication library for GPU clusters, enabling GPU-initiated, one-sided communication through symmetric memory. Despite its growing adoption, a system-level understanding of its design and behavior remains scattered across documentation, source code, and application experience. This paper presents a concise study of NVSHMEM's programming model, implementation, and performance characteristics, focusing on symmetric memory, one-sided operations, and device-side collectives. We also examine DeepEP as a case study of NVSHMEM in performance-critical sparse deep learning workloads. Our analysis shows that NVSHMEM pioneered a device-side symmetric-memory programming model that enables fine-grained GPU-driven communication and is important for approaching the hardware performance limit. Overall, this work defines NVSHMEM's role as a systems building block, highlights its design tradeoffs, and identifies opportunities for improving GPU communication runtimes.