Photonic Rails in ML Datacenters with Opus

📅 2026-02-13

📈 Citations: 0

✨ Influential: 0

Technology Category

Application Category

📝 Abstract

Rail-optimized network fabrics have become the de facto datacenter scale-out fabric for large-scale ML training. However, the use of high-radix electrical switches to provide all-to-all connectivity in rails imposes massive power and cost. We propose a rethinking of the rail abstraction by retaining its communication semantics, but realizing it using optical circuit switches. The key challenge is that optical switches support one-to-one connectivity at a time, limiting the fan-out of traffic in ML workloads using hybrid parallelisms. We overcome this through \emph{parallelism-driven rail reconfiguration}, which exploits the non-overlapping communication phases of different parallelism dimensions. This time-multiplexes a single set of physical ports across circuit configurations tailored to each phase within a training iteration. We design and implement Opus, a control plane that orchestrates this in-job reconfiguration of photonic rails at parallelism phase boundaries, and evaluate it on a physical OCS testbed, the Perlmutter supercomputer, and in simulation at up to 2,048 GPUs. Our results show that photonic rails can achieve over $23\times$ network power reduction and $4\times$ cost savings while incurring less than $6\%$ training overhead at production-relevant OCS reconfiguration latencies.

Problem

Research questions and friction points this paper is trying to address.

photonic rails

ML datacenters

network power

cost efficiency

optical circuit switches

Innovation

Methods, ideas, or system contributions that make the work stand out.

photonic rails

optical circuit switching

parallelism-driven reconfiguration

ML datacenter networking

Opus

🔎 Similar Papers

Streamlined optical training of large-scale modern deep learning architectures with direct feedback alignment

2024-09-01Citations: 1

Authors to Follow