Short-circuiting Rings for Low-Latency AllReduce

📅 2025-10-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Conventional wisdom holds that Ring-AllReduce excels for large messages, while Recursive Doubling (RD) is preferable for small messages due to its logarithmic step count—yet this assumption neglects propagation latency and link congestion in practical optical interconnects. Method: This work introduces the first optical interconnect optimization framework supporting *intra-collective* dynamic topology reconfiguration, leveraging programmable optical switches for low-overhead circuit switching and a novel heuristic path scheduling algorithm that jointly models propagation delay, bandwidth constraints, and reconfiguration overhead. Contribution/Results: Experiments demonstrate that dynamically reconfigured RD achieves up to 1.8× speedup over static Ring for short messages—empirically validating, for the first time, the feasibility and efficacy of “short-path” recursive collective communication in realistic optical networks.

Technology Category

Application Category

📝 Abstract
Efficient collective communication is critical for many distributed ML and HPC applications. In this context, it is widely believed that the Ring algorithm for the AllReduce collective communication operation is optimal only for large messages, while Recursive Doubling is preferable for small ones due to its logarithmic number of steps compared to the linear number for Ring. In this paper, we challenge this long-held assumption and show that the Ring algorithm can remain optimal even for short messages in ring-based GPU-to-GPU topologies, once realistic propagation delays and link capacity constraints are accounted for. We find that the total propagation delay for both Ring and Recursive Doubling essentially sums to the same value, but the latter incurs significantly higher congestion due to longer hop counts, leading to increased completion times. This surprising result motivates our case for in-collective adaptive topologies, particularly in the context of emerging photonic interconnects, which can break through the limitations of static topology designs at the collective communication granularity. We design a emph{simple and fast} heuristic for circuit-switching that enables Recursive Doubling to exploit dynamically reconfigurable photonic paths, carefully balancing reconfiguration delays, propagation latencies, and link congestion to minimize overall completion time. Our preliminary evaluations, using realistic reconfiguration delays, show that our circuit-switching schedules enable faster completion times for Recursive Doubling, even compared to Ring AllReduce on static ring topologies. We conclude by highlighting key challenges and future research directions for realizing practical, in-collective photonic switching.
Problem

Research questions and friction points this paper is trying to address.

Challenging the assumption that Ring AllReduce is suboptimal for small messages
Analyzing propagation delays and congestion in ring-based GPU topologies
Proposing dynamic photonic interconnects to optimize collective communication performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Ring algorithm optimized for short messages
Circuit-switching heuristic for photonic paths
Balancing reconfiguration delays and link congestion
🔎 Similar Papers
No similar papers found.