🤖 AI Summary
Existing HBD architectures for LLM training suffer from three critical bottlenecks: poor scalability, high cost, and weak fault tolerance—switch-centric designs (e.g., NVL-72) incur prohibitive scaling costs; GPU-centric designs (e.g., TPUv3/Dojo) experience severe fault propagation; and hybrid designs (e.g., TPUv4) remain vulnerable to large-scale fault explosion. This paper proposes a novel hyperscale HBD architecture tailored for LLM training, introducing the first transceiver-level optical circuit switching (OCS) technique. It enables per-node fault isolation, k-hop reconfigurable ring topologies, and coordinated scheduling between HBD and the data center network (DCN). Leveraging silicon photonics–integrated, low-power OCS transceivers and cross-layer communication optimization, our design reduces cost by 69% versus NVL-72, drives GPU waste toward zero (a >10× reduction), eliminates cross-rack traffic when node failure rate remains below 7%, and boosts model FLOPs utilization by 3.37×.
📝 Abstract
Scaling Large Language Model (LLM) training relies on multi-dimensional parallelism, where High-Bandwidth Domains (HBDs) are critical for communication-intensive parallelism like Tensor Parallelism (TP) and Expert Parallelism (EP). However, existing HBD architectures face fundamental limitations in scalability, cost, and fault resiliency: switch-centric HBDs (e.g., NVL-72) incur prohibitive scaling costs, while GPU-centric HBDs (e.g., TPUv3/Dojo) suffer from severe fault propagation. Switch-GPU hybrid HBDs such as TPUv4 takes a middle-ground approach by leveraging Optical Circuit Switches, but the fault explosion radius remains large at the cube level (e.g., 64 TPUs). We propose InfinitePOD, a novel transceiver-centric HBD architecture that unifies connectivity and dynamic switching at the transceiver level using Optical Circuit Switching (OCS). By embedding OCS within each transceiver, InfinitePOD achieves reconfigurable point-to-multipoint connectivity, allowing the topology to adapt into variable-size rings. This design provides: i) datacenter-wide scalability without cost explosion; ii) fault resilience by isolating failures to a single node, and iii) full bandwidth utilization for fault-free GPUs. Key innovations include a Silicon Photonic (SiPh) based low-cost OCS transceiver (OCSTrx), a reconfigurable k-hop ring topology co-designed with intra-/inter-node communication, and an HBD-DCN orchestration algorithm maximizing GPU utilization while minimizing cross-ToR datacenter network traffic. The evaluation demonstrates that InfinitePOD achieves 31% of the cost of NVL-72, near-zero GPU waste ratio (over one order of magnitude lower than NVL-72 and TPUv4), near-zero cross-ToR traffic when node fault ratios under 7%, and improves Model FLOPs Utilization by 3.37x compared to NVIDIA DGX (8 GPUs per Node).