Scaling Intelligence: Designing Data Centers for Next-Gen Language Models

📅 2025-06-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Emerging trillion-parameter LLMs (e.g., GPT-4) impose unprecedented demands on datacenter infrastructure, exposing critical bottlenecks in computation, memory bandwidth, and network interconnect. Method: We propose a hardware–software co-design framework for LLM-scale datacenters, featuring the novel FullFlat all-to-all optical interconnect architecture; an end-to-end LLM performance modeling tool with <10% error; and integrated modeling of FLOPS–HBM–network trade-offs, hardware-accelerated collective communication, joint MoE/dense Transformer evaluation, and scale-out domain analysis. Contribution/Results: Our framework quantifies the intrinsic relationship between Model FLOPS Utilization (MFU) and system parameters—the first such characterization. Experiments demonstrate substantial MFU and training throughput improvements; validate key gains from compute–communication overlap, high-capacity HBM, and wide scale-out domains; and deliver the first deployable, system-level design roadmap for trillion-parameter model training infrastructures.

Technology Category

Application Category

📝 Abstract
The explosive growth of Large Language Models (LLMs) - such as GPT-4 with 1.8 trillion parameters - demands a radical rethinking of data center architecture to ensure scalability, efficiency, and cost-effectiveness. Our work provides a comprehensive co-design framework that jointly explores FLOPS, HBM bandwidth and capacity, multiple network topologies (two-tier vs. FullFlat optical), the size of the scale-out domain, and popular parallelism/optimization strategies used in LLMs. We introduce and evaluate FullFlat network architectures, which provide uniform high-bandwidth, low-latency connectivity between all nodes, and demonstrate their transformative impact on performance and scalability. Through detailed sensitivity analyses, we quantify the benefits of overlapping compute and communication, leveraging hardware-accelerated collectives, wider scale-out domains, and larger memory capacity. Our study spans both sparse (mixture of experts) and dense transformer-based LLMs, revealing how system design choices affect Model FLOPS Utilization (MFU = Model flops per token x Observed tokens per sec / Peak flops of the hardware) and overall throughput. For the co-design study, we extended and validated a performance modeling tool capable of predicting LLM runtime within 10% of real-world measurements. Our findings offer actionable insights and a practical roadmap for designing AI data centers that can efficiently support trillion-parameter models, reduce optimization complexity, and sustain the rapid evolution of AI capabilities.
Problem

Research questions and friction points this paper is trying to address.

Redesign data centers for scalable, efficient LLM infrastructure
Evaluate FullFlat networks for high-bandwidth, low-latency connectivity
Optimize system design to improve Model FLOPS Utilization (MFU)
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive co-design framework for LLMs
FullFlat network architectures for high performance
Performance modeling tool with 10% accuracy
🔎 Similar Papers
No similar papers found.