Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Predicting end-to-end training time for distributed LLMs is highly challenging due to tight coupling among Transformer components, multi-dimensional parallelism (data, model, pipeline, and tensor), and complex hierarchical communication interactions. This paper proposes a performance modeling approach that combines operator-level decomposition with hardware-aware lightweight sampling—enabling, for the first time, CPU-only, high-accuracy training-time prediction across heterogeneous platforms (e.g., A100, GH200) and supporting all major parallelism strategies. Our method features fine-grained operator modeling, multi-level communication modeling, and an end-to-end integrated framework. Evaluated on the Perlmutter and Vista clusters for 20B-parameter models trained across 128 GPUs, it achieves mean absolute percentage errors of 4.98% and 9.38%, respectively. This significantly reduces empirical tuning overhead and accelerates co-design iterations of hardware architectures and distributed training strategies.

Technology Category

Application Category

📝 Abstract
Training Large Language Models(LLMs) is one of the most compute-intensive tasks in high-performance computing. Predicting end-to-end training time for multi-billion parameter models distributed across hundreds of GPUs remains challenging due to complex interactions between transformer components, parallelism strategies(data, model, pipeline, tensor), and multi-tier communication. Learned models require costly sampling, while analytical models often struggle with real-world network and hardware complexities. We address this by decomposing LLMs into core computational primitives and modeling them with: (1) operator-level decomposition for fine-grained analysis; (2) lightweight sampling based hardware-aware prediction models for key operations; (3) an end-to-end prediction system integrating these components across complex parallelization strategies. Crucially, our methodology has been validated on two large-scale HPC systems. Our framework achieves low average prediction errors-4.98% on Perlmutter(A100) and 9.38% on Vista(GH200)-for models up to 20B parameters across 128 GPUs. Importantly, it runs entirely on CPUs, enabling rapid iteration over hardware configurations and training strategies without costly on-cluster experimentation.
Problem

Research questions and friction points this paper is trying to address.

Predicting distributed LLM training time across multiple GPUs
Modeling complex interactions between transformer components and parallelism
Addressing limitations of analytical models with real-world hardware complexities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Operator-level decomposition for fine-grained analysis
Lightweight sampling for hardware-aware prediction models
End-to-end prediction system for complex parallelization strategies
🔎 Similar Papers
No similar papers found.
Biyao Zhang
Biyao Zhang
Aerospace Informationi Research Institute, Chinese Academy of Sciences
Remote sensing of Environment
Mingkai Zheng
Mingkai Zheng
Rutgers University, Brunswick, NJ, USA
D
Debargha Ganguly
Case Western Reserve University, Cleveland, OH, USA
X
Xuecen Zhang
Case Western Reserve University, Cleveland, OH, USA
V
Vikash Singh
Case Western Reserve University, Cleveland, OH, USA
Vipin Chaudhary
Vipin Chaudhary
Case Western Reserve University
High Performance ComputingArtificial IntelligenceData ScienceComputer VisionQuantum Computing
Z
Zhao Zhang
Rutgers University, Brunswick, NJ, USA