π€ AI Summary
Existing one-sided communication frameworks, such as MPI RMA and OpenSHMEM, struggle to meet the demands of high-performance computing due to limited scalability, rigid memory models, and the requirement for blocking synchronization during window creation. To address these limitations, this work proposes RAMCβa lightweight explicit one-sided communication library built upon persistent one-way communication channels. Leveraging the hardware capabilities of HPE Slingshot interconnects and employing memory region counters for efficient completion notification, RAMC preserves the flexibility of RDMA while overcoming constraints inherent in traditional shared-memory models. Implemented atop the libfabric interface, RAMC demonstrates robust scalability up to 250 nodes (196,000 processes), achieving bandwidth improvements of 100%β130% over Cray MPI on libfabric 1.15.2 and 30%β45% on libfabric 2.3.1.
π Abstract
In this paper, we present Remote Access Memory Channels (RAMC), an explicit one-sided communication library designed to leverage the capabilities of HPE Cray Slingshot network hardware. Existing one-sided communication frameworks, such as MPI RMA and OpenSHMEM, rely on monolithic shared memory models that introduce scalability and usability challenges. These frameworks often assume symmetric memory regions or require blocking collective operations for window creation, which can mismatch user communication needs and hinder performance. Implicit models, such as PGAS and UPC, aim to simplify programming by treating local and remote memory as a unified region but ultimately rely on explicit mechanisms to implement data movement. MPI's recently-introduced partitioned communication API offers a persistent point-to-point interface but sacrifices the dynamic flexibility of RDMA. RAMC is designed to address these limitations. Based on the core concept of a persistent uni-directional communication channel, RAMC leverages Slingshot's unique memory region counters to enable efficient completion notification. Experiments with a RAMC-based heat diffusion code demonstrate RAMC has no difficulty scaling to 19.6 thousand processes across 250 nodes, and microbenchmark studies across multiple libfabric versions show RAMC can outperform Cray's proprietary MPI implementation (e.g., increases in bandwidth ranging from approx. 100%-130% for 1B-4KiB messages under libfabric 1.15.2, and from approx. 30%-45% under libfabric 2.3.1) while identifying additional areas for improvement, such as small message latencies.