ASTRA-sim 3.0: Next-Level Distributed Machine Learning Simulations via High-Fidelity GPU and Infrastructure Modeling

📅 2026-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing simulators for distributed machine learning struggle to faithfully model latency-sensitive collective communications and fine-grained interactions between GPUs and infrastructure, limiting system-level design space exploration. This work proposes a high-fidelity simulation framework that enables accurate co-simulation of collective communication algorithms, network requirements, and GPU architectures through cache-line-granularity communication modeling, a fine-grained GPU execution model, and a standardized InfraGraph representation of infrastructure. By integrating these components, the framework significantly enhances simulation accuracy and generality, offering a powerful tool for efficiently exploring and optimizing distributed machine learning systems.
📝 Abstract
Distributed machine learning (ML) is a key paradigm for today's large-scale artificial intelligence applications. As model inference arises as an important use case, faithful modeling of latency-sensitive collective communication has never been more important. Capturing the device architecture and modeling control and data paths at high fidelity is therefore a necessity today. Having a common, detailed representation for distributed ML infrastructure is also crucial. We revisit the promising open-source, community-driven simulator: ASTRA-sim. In this work, we identify limitations of the current ASTRA-sim simulator and augment it with new features. To this end, we enable fine-grained, high-fidelity simulation with a standardized infrastructure representation, opening new design space exploration opportunities. We propose the simulation at cache-line-sized load-store granularity, with a detailed graphics processing unit (GPU) execution model, to balance simulation scalability and fidelity. We also introduce InfraGraph, a standardized representation to capture distributed ML network infrastructure in detail. Using the updated ASTRA-sim 3.0 simulator, we showcase interesting design space explorations for designing optimized collective algorithms, network requirements, and GPU architectures.
Problem

Research questions and friction points this paper is trying to address.

distributed machine learning
high-fidelity simulation
collective communication
GPU modeling
infrastructure representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

high-fidelity simulation
cache-line granularity
GPU execution model
InfraGraph
collective communication
🔎 Similar Papers
No similar papers found.
William Won
William Won
School of Computer Science, Georgia Institute of Technology
Computer Science
J
Jinsun Yoo
Georgia Institute of Technology
Tuan Ta
Tuan Ta
AMD Research and Advanced Development
Computer Architecture
Moumita Dey
Moumita Dey
Researcher, AMD Research
Computer ArchitectureSide-ChannelsHardware SecurityEmbedded Systems
A
Andy Balogh
Keysight
P
Pradosh Datta
Keysight
F
Furkan Eris
AMD Research and Advanced Development
C
Conor Green
AMD Research and Advanced Development, Purdue University
W
Winston Liu
Keysight
C
Changhai Man
Georgia Institute of Technology
K
Kingshuk Mandal
Keysight
A
Amos Rai
Keysight
V
Vinay Ramakrishnaiah
AMD Research and Advanced Development
Ruchi Shah
Ruchi Shah
University of Houston
Numerical MethodsHigh Performance ComputingParallel ProcessingBig Data Analytics
David Sidler
David Sidler
Microsoft Corporation, Redmond WA
H
Harsh Sikhwal
Keysight
H
Hanjiang Wu
Georgia Institute of Technology
Tushar Krishna
Tushar Krishna
Associate Professor, Georgia Tech
Computer ArchitectureInterconnection NetworksNetwork-on-ChipDeep Learning Accelerators
B
Bradford M. Beckmann
AMD Research and Advanced Development