InfinitePOD: Building Datacenter-Scale High-Bandwidth Domain for LLM with Optical Circuit Switching Transceivers

📅 2025-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing HBD architectures for LLM training suffer from three critical bottlenecks: poor scalability, high cost, and weak fault tolerance—switch-centric designs (e.g., NVL-72) incur prohibitive scaling costs; GPU-centric designs (e.g., TPUv3/Dojo) experience severe fault propagation; and hybrid designs (e.g., TPUv4) remain vulnerable to large-scale fault explosion. This paper proposes a novel hyperscale HBD architecture tailored for LLM training, introducing the first transceiver-level optical circuit switching (OCS) technique. It enables per-node fault isolation, k-hop reconfigurable ring topologies, and coordinated scheduling between HBD and the data center network (DCN). Leveraging silicon photonics–integrated, low-power OCS transceivers and cross-layer communication optimization, our design reduces cost by 69% versus NVL-72, drives GPU waste toward zero (a >10× reduction), eliminates cross-rack traffic when node failure rate remains below 7%, and boosts model FLOPs utilization by 3.37×.

Technology Category

Application Category

📝 Abstract
Scaling Large Language Model (LLM) training relies on multi-dimensional parallelism, where High-Bandwidth Domains (HBDs) are critical for communication-intensive parallelism like Tensor Parallelism (TP) and Expert Parallelism (EP). However, existing HBD architectures face fundamental limitations in scalability, cost, and fault resiliency: switch-centric HBDs (e.g., NVL-72) incur prohibitive scaling costs, while GPU-centric HBDs (e.g., TPUv3/Dojo) suffer from severe fault propagation. Switch-GPU hybrid HBDs such as TPUv4 takes a middle-ground approach by leveraging Optical Circuit Switches, but the fault explosion radius remains large at the cube level (e.g., 64 TPUs). We propose InfinitePOD, a novel transceiver-centric HBD architecture that unifies connectivity and dynamic switching at the transceiver level using Optical Circuit Switching (OCS). By embedding OCS within each transceiver, InfinitePOD achieves reconfigurable point-to-multipoint connectivity, allowing the topology to adapt into variable-size rings. This design provides: i) datacenter-wide scalability without cost explosion; ii) fault resilience by isolating failures to a single node, and iii) full bandwidth utilization for fault-free GPUs. Key innovations include a Silicon Photonic (SiPh) based low-cost OCS transceiver (OCSTrx), a reconfigurable k-hop ring topology co-designed with intra-/inter-node communication, and an HBD-DCN orchestration algorithm maximizing GPU utilization while minimizing cross-ToR datacenter network traffic. The evaluation demonstrates that InfinitePOD achieves 31% of the cost of NVL-72, near-zero GPU waste ratio (over one order of magnitude lower than NVL-72 and TPUv4), near-zero cross-ToR traffic when node fault ratios under 7%, and improves Model FLOPs Utilization by 3.37x compared to NVIDIA DGX (8 GPUs per Node).
Problem

Research questions and friction points this paper is trying to address.

Scalability and cost issues in High-Bandwidth Domains.
Fault resilience and isolation in large-scale LLM training.
Optical Circuit Switching for efficient GPU utilization.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optical Circuit Switching transceivers
Reconfigurable k-hop ring topology
HBD-DCN orchestration algorithm
🔎 Similar Papers
No similar papers found.
C
Chenchen Shou
Peking University, StepFun, Lightelligence Pte. Ltd.
G
Guyue Liu
Peking University
Hao Nie
Hao Nie
Stepfun
H
Huaiyu Meng
Lightelligence Pte. Ltd.
Y
Yu Zhou
StepFun
Y
Yinmin Jiang
Unaffiliated
W
Wenqing Lv
Lightelligence Pte. Ltd.
Y
Yelong Xu
Lightelligence Pte. Ltd.
Y
Yuanwei Lu
StepFun
Z
Zhang Chen
Lightelligence Pte. Ltd.
Y
Yanbo Yu
StepFun
Y
Yichen Shen
Lightelligence Pte. Ltd.
Y
Yibo Zhu
StepFun
Daxin Jiang
Daxin Jiang
Co-Founder & CEO, StepFun Corporation
Deep LearningFoundation Models