π€ AI Summary
This work addresses load imbalance in communication-intensive parallel applications caused by irregular and time-varying workloads, with particular attention to the cross-node communication overhead among continuously interacting objects. The authors propose a communication-aware distributed diffusion-based load balancing method that, for the first time, incorporates communication graph information into the diffusion process to simultaneously balance computational load and minimize inter-node communication. An adaptive variant is also introduced for scenarios with unknown communication patterns, which infers communication structures online to enable effective optimization. Experimental evaluation on both synthetic workloads and the Particle-in-Cell benchmark demonstrates the approachβs efficacy, showing significant improvements over existing strategies on an 8-node configuration of the NERSC Perlmutter system.
π Abstract
Parallel applications with irregular and time-varying workloads often suffer from load imbalance. Dynamic load balancing techniques address this challenge by redistributing work during execution. We present a new type of distributed diffusion-based load balancing targeted at communication-intensive applications with persistently communicating objects. Leveraging the application's communication graph, our strategy reduces across-node communication while simultaneously distributing load effectively. We also propose an algorithmic variant for cases where the communication patterns are not readily available. We explore optimizations to our algorithm, and comparisons with other related load balancing strategies in simulation and on a Particle-in-Cell benchmark on up to 8 nodes of Perlmutter at NERSC.