Optimizing Allreduce Operations for Heterogeneous Architectures with Multiple Processes per GPU

📅 2025-08-18

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

To address the AllReduce communication bottleneck among GPUs and underutilized CPU cores in heterogeneous architectures, this paper proposes a lane-aware reduction optimization method that orchestrates multiple CPU cores with GPUs. Specifically, it extends lane-aware reduction to the GPU side for the first time, binding and scheduling multiple CPU cores per GPU to process data segmentation and reduction computations in parallel. The approach integrates multi-process CPU–GPU affinity binding and coordinated scheduling, enabling seamless incorporation into MPI and vendor-optimized communication libraries (NCCL and RCCL). Experimental evaluation on the Delta supercomputer demonstrates up to a 2.45× improvement in AllReduce throughput; corresponding speedups of 1.77× and 1.71× are achieved over NCCL and RCCL, respectively. This work significantly unlocks the latent potential of heterogeneous computing resources by jointly optimizing CPU and GPU utilization.

Technology Category

Application Category

📝 Abstract

Large inter-GPU all-reduce operations, prevalent throughout deep learning, are bottlenecked by communication costs. Emerging heterogeneous architectures are comprised of complex nodes, often containing $4$ GPUs and dozens to hundreds of CPU cores per node. Parallel applications are typically accelerated on the available GPUs, using only a single CPU core per GPU while the remaining cores sit idle. This paper presents novel optimizations to large GPU-aware all-reduce operations, extending lane-aware reductions to the GPUs, and notably using multiple CPU cores per GPU to accelerate these operations. These multi-CPU-accelerated GPU-aware lane all-reduces yield speedup of up to $2.45$x for large MPI all-reduces across the NVIDIA A100 GPUs of NCSA's Delta supercomputer. Finally, the approach is extended to NVIDIA's and AMD's collective communication libraries, achieving speedup of up to $1.77$x and $1.71$x, respectively, across $2$ state-of-the-art supercomputers.

Problem

Research questions and friction points this paper is trying to address.

Optimizing all-reduce operations for heterogeneous GPU-CPU architectures

Reducing communication bottlenecks in large-scale deep learning systems

Leveraging idle CPU cores to accelerate GPU collective operations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multiple CPU cores per GPU accelerate operations

Lane-aware reductions extended to GPUs

Optimized all-reduce for heterogeneous architectures

🔎 Similar Papers

No similar papers found.