🤖 AI Summary
This paper addresses the fault-tolerance challenge in distributed machine learning collective communication over public networks. We propose PCCL, a WAN-optimized fault-tolerant collective communication library. Methodologically, PCCL introduces: (i) a novel dynamic membership programming model enabling runtime node join/leave and failure recovery; (ii) a deterministic state machine coupled with asynchronous DiLoCo optimization to guarantee bit-parity correctness while hiding communication latency; and (iii) multi-connection TCP scheduling, quantized communication, and efficient intercontinental link utilization (achieving 45 Gbit/s across Europe). PCCL is fully compatible with PyTorch and Fully Sharded Data Parallel (FSDP). Comprehensive stress testing across diverse platforms demonstrates robust operation under frequent node churn, significantly improved bandwidth utilization, substantially reduced collective communication frequency and volume, and high-throughput, low-latency all-reduce operations.
📝 Abstract
This report presents the Prime Collective Communications Library (PCCL), a novel fault-tolerant collective communication library designed for distributed ML workloads over the public internet. PCCL introduces a new programming model that enables dynamic peer joining and failure recovery. The library implements efficient collective operations like all-reduce while providing robust fault tolerance mechanisms that allow the system to continue operating even when peers fail or join during ongoing operations. We demonstrate that PCCL's design enables practical solutions to dynamic membership challenges in workloads with repeated operations and deterministic state advancement. Our implementation passes extensive stress tests across all major operating systems, showing reliable operation even under rapid peer churn and concurrent collective operations. By dispatching to multiple connections, we can efficiently utilize cross-continental long-fat-pipe TCP WAN links, in our experiments achieving up to 45 Gbit/s of bandwidth utilization across Europe and 25 Gbit/s across North America and Europe. PCCL's architecture enables easy implementation of distributed low-communication optimization strategies like DiLoCo, which significantly reduce communication frequency. Combined with quantization, this leads to a significant reduction in the bandwidth required for distributed training workloads. PCCL also allows for concurrent collective operations, which enables optimization strategies like async DiLoCo, which can completely hide communication overhead by implementing one-step delayed parameter updates. PCCL can facilitate exact bit-parity of the shared state across peers in all cases induced by graceful or abrupt peer churn. While PCCL exposes a C99 API, Python bindings are available which are compatible with PyTorch alongside FSDP. PCCL is available under the open source MIT license.