🤖 AI Summary
In large-scale GPU clusters, network failures severely degrade AllReduce communication performance, and existing fault-tolerant methods suffer from degraded nodes remaining on the critical path, thereby slowing overall training. This work proposes OptCC, a four-stage pipelined AllReduce algorithm that, for the first time, derives an information-theoretic lower bound on AllReduce completion time under asymmetric bandwidth conditions and designs an efficient fault-tolerant scheme approaching this bound. By integrating ring topology, bandwidth-aware routing, and pipelined scheduling, OptCC demonstrates robust performance in SimAI platform experiments: even under up to 50% bandwidth loss, it incurs only 2–6% overhead compared to fault-free NCCL, significantly outperforming state-of-the-art approaches, which can suffer up to 57% overhead.
📝 Abstract
Network failures are among the most frequent hardware faults in large-scale GPU clusters and a leading cause of training-job interruptions. Modern collective communication libraries such as NCCL mitigate network failures by rerouting traffic through surviving NICs on the same server, trading reduced inter-node bandwidth for uninterrupted training. However, the degraded server remains on the critical path of the standard ring algorithm, slowing the entire collective. We present the first information-theoretic lower bound on AllReduce completion time under asymmetric network bandwidth and show that when the straggler retains at least half of its original bandwidth, the unavoidable overhead relative to the fault-free optimum is only O(1/p) for p GPUs. We then design OptCC, a four-stage pipelined AllReduce algorithm that approaches this lower bound. Experiments on SimAI confirm that OptCC closes the gap left by existing fault-tolerant schemes: under practical network failures with up to 50% bandwidth loss, OptCC completes AllReduce within 2-6% of NCCL's fault-free ring performance, whereas the state-of-the-art incurs up to 57% overhead.