Efficient Pre-Training of LLMs via Topology-Aware Communication Alignment on More Than 9600 GPUs

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
During large-scale pre-training of LLMs on GPU clusters with thousands of devices, severe bandwidth contention arises when bursty, high-bandwidth inter-GPU communications within communication groups misalign with the underlying physical network topology, significantly hindering training efficiency. This work presents the first systematic analysis revealing the critical impact of network topology on LLM training performance. We propose a topology-aware resource scheduling system that models communication patterns via deep feature analysis and designs a scheduling algorithm to minimize cross-topology dispersion of communication groups. Evaluated through large-scale simulations and real-world deployment, our approach achieves a 10.6% end-to-end training speedup in a production environment with over 9,600 GPUs. It substantially improves cross-node communication efficiency and cluster-wide resource utilization.

Technology Category

Application Category

📝 Abstract
The scaling law for large language models (LLMs) depicts that the path towards machine intelligence necessitates training at large scale. Thus, companies continuously build large-scale GPU clusters, and launch training jobs that span over thousands of computing nodes. However, LLM pre-training presents unique challenges due to its complex communication patterns, where GPUs exchange data in sparse yet high-volume bursts within specific groups. Inefficient resource scheduling exacerbates bandwidth contention, leading to suboptimal training performance. This paper presents Arnold, a scheduling system summarizing our experience to effectively align LLM communication patterns with data center topology at scale. An in-depth characteristic study is performed to identify the impact of physical network topology to LLM pre-training jobs. Based on the insights, we develop a scheduling algorithm to effectively align communication patterns with the physical network topology in modern data centers. Through simulation experiments, we show the effectiveness of our algorithm in reducing the maximum spread of communication groups by up to $1.67$x. In production training, our scheduling system improves the end-to-end performance by $10.6%$ when training with more than $9600$ GPUs, a significant improvement for our training pipeline.
Problem

Research questions and friction points this paper is trying to address.

Optimizing LLM pre-training communication on large GPU clusters
Aligning sparse high-volume bursts with network topology
Reducing bandwidth contention for efficient resource scheduling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Topology-aware scheduling system for LLMs
Aligns communication patterns with network topology
Reduces bandwidth contention in GPU clusters
🔎 Similar Papers
No similar papers found.
G
Guoliang He
University of Cambridge
Y
Youhe Jiang
University of Cambridge
Wencong Xiao
Wencong Xiao
ByteDance
Distributed systemMachine learning systemResource management
K
Kaihua Jiang
ByteDance Seed
S
Shuguang Wang
ByteDance Seed
J
Jun Wang
ByteDance Seed
Z
Zixian Du
ByteDance Seed
Z
Zhuo Jiang
ByteDance Seed
Xinlei Zhang
Xinlei Zhang
ByteDance Seed
B
Binhang Yuan
HKUST
Eiko Yoneki
Eiko Yoneki
Computer Laboratory, University of Cambridge
optimisationlarge-scale graph processingdistributed systems