Simulating LLM training workloads for heterogeneous compute and network infrastructure

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM training simulators commonly assume hardware homogeneity, failing to capture real-world performance deviations arising from resource sharing, device-generation heterogeneity, and on-chip interconnect diversity in cloud environments. To address this, we propose the first distributed LLM training simulator supporting fine-grained device heterogeneity. Our approach systematically models non-uniform computation and communication capabilities, enabling customizable hardware topologies and parallelism strategy abstractions. We further design a non-uniform workload partitioning mechanism that integrates computation-communication overlap modeling, fine-grained pipeline scheduling, and heterogeneous resource mapping. Experimental evaluation demonstrates that our simulator faithfully reproduces actual training time trends—achieving significantly higher fidelity than homogeneous baselines. This enables reliable performance assessment for training optimization and architecture co-design in heterogeneous systems.

Technology Category

Application Category

📝 Abstract
The growing demand for large-scale GPU clusters in distributed model training presents a significant barrier to innovation, particularly in model optimization, performance tuning, and system-level enhancements. To address this challenge, LLM training simulators are employed to estimate training time and guide design decisions. However, the state-of-the-art LLM training simulators assume homogeneous compute and network infrastructure. In practice, device heterogeneity is inevitable due to resource sharing in cloud environments, frequent shifts in device generations, and inherent intra-chip interconnect heterogeneity. To address the gap between state-of-the-art and practical requirements, we propose the design of a heterogeneity-aware distributed LLM simulator capable of predicting training time while enabling abstractions to specify custom configurations for device groups and device-to-parallelism mapping. We present the design requirements and challenges in building a heterogeneity-aware distributed ML training simulator, and design components such as non-uniform workload partitioning. Our initial simulation results demonstrate the impact of heterogeneity on the model computation and communication time.
Problem

Research questions and friction points this paper is trying to address.

Simulating LLM training for heterogeneous compute and network
Addressing device heterogeneity in distributed model training
Predicting training time with custom configurations and mappings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Heterogeneity-aware distributed LLM simulator
Non-uniform workload partitioning design
Custom device group configurations
🔎 Similar Papers
No similar papers found.
S
Sumit Kumar
IIIT-Delhi, India
A
Arjun Temura
IIIT-Delhi, India
N
Naman Sharma
IIIT-Delhi, India
R
Ramanjeet Singh
IIIT-Delhi, India
M
Meet Dadhania
IIT Hyderabad, India
Praveen Tammana
Praveen Tammana
Assistant Professor, IIT Hyderabad
Computer Systems and NetworkingSoftware-Defined NetworkingP4
S
Satananda Burla
Marvell Technology Inc., USA
A
Abed Mohammad Kamaluddin
Marvell Technology Inc., India
Rinku Shah
Rinku Shah
Assistant Professor, CSE Department, IIIT Delhi
Networked systemsSoftware-defined networkingProgrammable networks