PCCL: Process Group-Aware Scalable and Generic Collective Algorithm Synthesizer

📅 2026-06-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Collective communication in distributed machine learning often becomes a performance bottleneck due to the neglect of physical network topology and process group structure. This work proposes a scalable and general framework for synthesizing collective communication algorithms that, for the first time, incorporates process-group awareness into algorithm generation, supporting arbitrary communication patterns. By integrating topology-aware modeling with optimized search strategies, the framework automatically generates high-performance communication algorithms tailored to the actual process groups and underlying network topology. Experimental results demonstrate that the framework can synthesize an All-to-All algorithm for a 512-NPU system within 11.68 minutes, achieving performance close to the theoretical optimum.

📝 Abstract

Distributed machine learning has become increasingly important due to the massive scale of large-scale generative models. Both model parameters and data are distributed across many compute devices, which requires frequent collective communications to synchronize activations and parameter updates. Such collective communications have become a major bottleneck. While the performance of the collective algorithm depends on the physical network topology, the baseline collective algorithms in collective communication libraries are largely topology-agnostic. Collective algorithm synthesizers address this inefficiency by automatically generating topology-aware collective algorithms. However, prior works have largely overlooked that collective communication typically occurs only among a subset of devices, known as process groups. Additionally, most existing synthesizers are limited in the range of target collective patterns they can generate. We propose PCCL, a scalable and generic framework for synthesizing topology-aware collective algorithms. PCCL is process group-aware and capable of generating near-optimal collective algorithms even when only a subset of devices participates in collective operations. PCCL synthesizes arbitrary collective patterns, including 512-NPU All-to-All synthesis in 11.68 minutes.

Problem

Research questions and friction points this paper is trying to address.

collective communication

process group

topology-aware

algorithm synthesis

distributed machine learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

process group-aware

collective algorithm synthesis

topology-aware communication