🤖 AI Summary
Emerging trillion-parameter LLMs (e.g., GPT-4) impose unprecedented demands on datacenter infrastructure, exposing critical bottlenecks in computation, memory bandwidth, and network interconnect.
Method: We propose a hardware–software co-design framework for LLM-scale datacenters, featuring the novel FullFlat all-to-all optical interconnect architecture; an end-to-end LLM performance modeling tool with <10% error; and integrated modeling of FLOPS–HBM–network trade-offs, hardware-accelerated collective communication, joint MoE/dense Transformer evaluation, and scale-out domain analysis.
Contribution/Results: Our framework quantifies the intrinsic relationship between Model FLOPS Utilization (MFU) and system parameters—the first such characterization. Experiments demonstrate substantial MFU and training throughput improvements; validate key gains from compute–communication overlap, high-capacity HBM, and wide scale-out domains; and deliver the first deployable, system-level design roadmap for trillion-parameter model training infrastructures.
📝 Abstract
The explosive growth of Large Language Models (LLMs) - such as GPT-4 with 1.8 trillion parameters - demands a radical rethinking of data center architecture to ensure scalability, efficiency, and cost-effectiveness. Our work provides a comprehensive co-design framework that jointly explores FLOPS, HBM bandwidth and capacity, multiple network topologies (two-tier vs. FullFlat optical), the size of the scale-out domain, and popular parallelism/optimization strategies used in LLMs. We introduce and evaluate FullFlat network architectures, which provide uniform high-bandwidth, low-latency connectivity between all nodes, and demonstrate their transformative impact on performance and scalability. Through detailed sensitivity analyses, we quantify the benefits of overlapping compute and communication, leveraging hardware-accelerated collectives, wider scale-out domains, and larger memory capacity. Our study spans both sparse (mixture of experts) and dense transformer-based LLMs, revealing how system design choices affect Model FLOPS Utilization (MFU = Model flops per token x Observed tokens per sec / Peak flops of the hardware) and overall throughput. For the co-design study, we extended and validated a performance modeling tool capable of predicting LLM runtime within 10% of real-world measurements. Our findings offer actionable insights and a practical roadmap for designing AI data centers that can efficiently support trillion-parameter models, reduce optimization complexity, and sustain the rapid evolution of AI capabilities.