Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

255K/year

🤖 AI Summary

Predicting end-to-end training time for distributed LLMs is highly challenging due to tight coupling among Transformer components, multi-dimensional parallelism (data, model, pipeline, and tensor), and complex hierarchical communication interactions. This paper proposes a performance modeling approach that combines operator-level decomposition with hardware-aware lightweight sampling—enabling, for the first time, CPU-only, high-accuracy training-time prediction across heterogeneous platforms (e.g., A100, GH200) and supporting all major parallelism strategies. Our method features fine-grained operator modeling, multi-level communication modeling, and an end-to-end integrated framework. Evaluated on the Perlmutter and Vista clusters for 20B-parameter models trained across 128 GPUs, it achieves mean absolute percentage errors of 4.98% and 9.38%, respectively. This significantly reduces empirical tuning overhead and accelerates co-design iterations of hardware architectures and distributed training strategies.

Technology Category

Application Category

📝 Abstract

Training Large Language Models(LLMs) is one of the most compute-intensive tasks in high-performance computing. Predicting end-to-end training time for multi-billion parameter models distributed across hundreds of GPUs remains challenging due to complex interactions between transformer components, parallelism strategies(data, model, pipeline, tensor), and multi-tier communication. Learned models require costly sampling, while analytical models often struggle with real-world network and hardware complexities. We address this by decomposing LLMs into core computational primitives and modeling them with: (1) operator-level decomposition for fine-grained analysis; (2) lightweight sampling based hardware-aware prediction models for key operations; (3) an end-to-end prediction system integrating these components across complex parallelization strategies. Crucially, our methodology has been validated on two large-scale HPC systems. Our framework achieves low average prediction errors-4.98% on Perlmutter(A100) and 9.38% on Vista(GH200)-for models up to 20B parameters across 128 GPUs. Importantly, it runs entirely on CPUs, enabling rapid iteration over hardware configurations and training strategies without costly on-cluster experimentation.

Problem

Research questions and friction points this paper is trying to address.

Predicting distributed LLM training time across multiple GPUs

Modeling complex interactions between transformer components and parallelism

Addressing limitations of analytical models with real-world hardware complexities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Operator-level decomposition for fine-grained analysis

Lightweight sampling for hardware-aware prediction models

End-to-end prediction system for complex parallelization strategies

🔎 Similar Papers

No similar papers found.