🤖 AI Summary
In modern distributed machine learning, the practical performance of collective communication is severely constrained by network congestion and multi-hop latency in GPU clusters, leading to substantial gaps between theoretical predictions and empirical measurements. To address this, we propose the first hardware-agnostic photonic collective communication optimization framework. It dynamically reconfigures network topology via photonic circuit switching, establishes contention-free direct paths by adapting to communication patterns, and employs a latency-benefit trade-off algorithm to adaptively schedule arbitrary collective operations alongside topology configurations. This approach overcomes the limitations of static interconnects by enabling real-time alignment of communication paths with computational requirements. Evaluated on a 128-GPU cluster, our framework achieves up to 3× higher communication throughput and a 1.3× improvement in end-to-end training throughput. It establishes a new paradigm for large-scale AI training: efficient, general-purpose, and scalable photonic interconnection.
📝 Abstract
Modern distributed ML suffers from a fundamental gap between the theoretical and realized performance of collective communication algorithms due to congestion and hop-count induced dilation in practical GPU clusters. We present PCCL, a Photonic Collective Communication Library that reconfigures the network topology to match the communication patterns of collective algorithms, thereby eliminating congestion and dilation by creating direct, contention-free circuits between communicating GPUs. Unlike prior approaches that synthesize algorithms for specific network topologies and collectives, PCCL generalizes to any collective primitive and any topology by adapting the network to match each algorithm's communication pattern. PCCL's key innovation lies in its hardware-agnostic optimization framework that intelligently decides when to reconfigure based on the trade-off between network reconfiguration delay and congestion/dilation costs, making it practical across different optical hardware with varying switching speeds. Our evaluation demonstrates that PCCL achieves up to 3X speedup over state-of-the-art algorithms on 128 GPUs across various workloads, buffer sizes, and topologies, translating to a 1.3X speedup in end-to-end training throughput.