InfinitePOD: Building Datacenter-Scale High-Bandwidth Domain for LLM with Optical Circuit Switching Transceivers

📅 2025-02-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing HBD architectures for LLM training suffer from three critical bottlenecks: poor scalability, high cost, and weak fault tolerance—switch-centric designs (e.g., NVL-72) incur prohibitive scaling costs; GPU-centric designs (e.g., TPUv3/Dojo) experience severe fault propagation; and hybrid designs (e.g., TPUv4) remain vulnerable to large-scale fault explosion. This paper proposes a novel hyperscale HBD architecture tailored for LLM training, introducing the first transceiver-level optical circuit switching (OCS) technique. It enables per-node fault isolation, k-hop reconfigurable ring topologies, and coordinated scheduling between HBD and the data center network (DCN). Leveraging silicon photonics–integrated, low-power OCS transceivers and cross-layer communication optimization, our design reduces cost by 69% versus NVL-72, drives GPU waste toward zero (a >10× reduction), eliminates cross-rack traffic when node failure rate remains below 7%, and boosts model FLOPs utilization by 3.37×.

Technology Category

Application Category

📝 Abstract

Scaling Large Language Model (LLM) training relies on multi-dimensional parallelism, where High-Bandwidth Domains (HBDs) are critical for communication-intensive parallelism like Tensor Parallelism (TP) and Expert Parallelism (EP). However, existing HBD architectures face fundamental limitations in scalability, cost, and fault resiliency: switch-centric HBDs (e.g., NVL-72) incur prohibitive scaling costs, while GPU-centric HBDs (e.g., TPUv3/Dojo) suffer from severe fault propagation. Switch-GPU hybrid HBDs such as TPUv4 takes a middle-ground approach by leveraging Optical Circuit Switches, but the fault explosion radius remains large at the cube level (e.g., 64 TPUs). We propose InfinitePOD, a novel transceiver-centric HBD architecture that unifies connectivity and dynamic switching at the transceiver level using Optical Circuit Switching (OCS). By embedding OCS within each transceiver, InfinitePOD achieves reconfigurable point-to-multipoint connectivity, allowing the topology to adapt into variable-size rings. This design provides: i) datacenter-wide scalability without cost explosion; ii) fault resilience by isolating failures to a single node, and iii) full bandwidth utilization for fault-free GPUs. Key innovations include a Silicon Photonic (SiPh) based low-cost OCS transceiver (OCSTrx), a reconfigurable k-hop ring topology co-designed with intra-/inter-node communication, and an HBD-DCN orchestration algorithm maximizing GPU utilization while minimizing cross-ToR datacenter network traffic. The evaluation demonstrates that InfinitePOD achieves 31% of the cost of NVL-72, near-zero GPU waste ratio (over one order of magnitude lower than NVL-72 and TPUv4), near-zero cross-ToR traffic when node fault ratios under 7%, and improves Model FLOPs Utilization by 3.37x compared to NVIDIA DGX (8 GPUs per Node).

Problem

Research questions and friction points this paper is trying to address.

Scalability and cost issues in High-Bandwidth Domains.

Fault resilience and isolation in large-scale LLM training.

Optical Circuit Switching for efficient GPU utilization.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optical Circuit Switching transceivers

Reconfigurable k-hop ring topology

HBD-DCN orchestration algorithm

🔎 Similar Papers

No similar papers found.

Authors to Follow