🤖 AI Summary
In large-scale parallel iterative stencil computations, communication overhead dominates performance. This paper systematically investigates the synergistic optimization of persistent and segmented MPI for stencil communication. We propose a unified optimization framework integrating non-blocking, persistent, and segmented communication primitives, and— for the first time—quantitatively characterize the impact of process count, thread count, and message size on segmented communication performance. Using the Comb benchmark, we conduct multi-scale empirical evaluation: persistent MPI achieves up to 37% speedup, segmented MPI up to 68%, and their synergy further alleviates synchronization bottlenecks, significantly improving communication efficiency. Our work establishes a reproducible methodology and empirically grounded guidelines for communication optimization in stencil-based applications.
📝 Abstract
Many parallel applications rely on iterative stencil operations, whose performance are dominated by communication costs at large scales. Several MPI optimizations, such as persistent and partitioned communication, reduce overheads and improve communication efficiency through amortized setup costs and reduced synchronization of threaded sends. This paper presents the performance of stencil communication in the Comb benchmarking suite when using non blocking, persistent, and partitioned communication routines. The impact of each optimization is analyzed at various scales. Further, the paper presents an analysis of the impact of process count, thread count, and message size on partitioned communication routines. Measured timings show that persistent MPI communication can provide a speedup of up to 37% over the baseline MPI communication, and partitioned MPI communication can provide a speedup of up to 68%.