🤖 AI Summary
This work addresses the lack of a high-throughput, general-purpose solution for large-scale AI training that simultaneously respects physical constraints and optimizes network topology, routing, and collective communication. The authors propose TONS, a framework that enables automated throughput-optimized network synthesis for AI supercomputers. TONS formulates topology synthesis as a linear optimization problem and scales to thousands of nodes by integrating theoretical insights with heuristic methods. It also introduces a deadlock-free routing mechanism supporting limited virtual channels and fault tolerance in optical switching. Under realistic deployment constraints, TONS achieves geometric mean speedups of 2.1× and 1.6× over the best TPU v4/5p torus variants for uniform random and all-to-all communication patterns, respectively.
📝 Abstract
Datacenter network design plays a critical role in AI training by supporting scaling to thousands of accelerators. An open problem, designing a near-optimal throughput oriented network-topology, routing, and collectives-has not been achieved at scale and with broad applicability to physical/implementation constraints. We address this problem with a compelling use-case, Google's TPU v4/5p supercomputer where the topology may be reconfigured to achieve higher all-to-all throughput, supporting large, parallelized AI training. We show that the existing TPU networks leave terabytes per second of throughput on the table and we fill that gap. This paper presents Throughput Optimized Networks at Scale (TONS), an automated network synthesis framework that meets the high-throughput demands of modern computing. TONS formulates topology synthesis as a linear optimization problem that maximizes a throughput-centric proxy metric, using theory and heuristics to scale to thousands of nodes. We further introduce a deadlock-free routing scheme compatible with limited virtual channels and optical switch faults, enabling the synthesized topologies to realize their predicted throughput gains in simulation. Evaluating uniform random and all-to-all traffic, TONS networks have a geometric mean speedups of 2.1x and 1.6x, respectively, over the best TPU v4/5p torus variants.