🤖 AI Summary
Predicting end-to-end training time for distributed LLMs is highly challenging due to tight coupling among Transformer components, multi-dimensional parallelism (data, model, pipeline, and tensor), and complex hierarchical communication interactions. This paper proposes a performance modeling approach that combines operator-level decomposition with hardware-aware lightweight sampling—enabling, for the first time, CPU-only, high-accuracy training-time prediction across heterogeneous platforms (e.g., A100, GH200) and supporting all major parallelism strategies. Our method features fine-grained operator modeling, multi-level communication modeling, and an end-to-end integrated framework. Evaluated on the Perlmutter and Vista clusters for 20B-parameter models trained across 128 GPUs, it achieves mean absolute percentage errors of 4.98% and 9.38%, respectively. This significantly reduces empirical tuning overhead and accelerates co-design iterations of hardware architectures and distributed training strategies.
📝 Abstract
Training Large Language Models(LLMs) is one of the most compute-intensive tasks in high-performance computing. Predicting end-to-end training time for multi-billion parameter models distributed across hundreds of GPUs remains challenging due to complex interactions between transformer components, parallelism strategies(data, model, pipeline, tensor), and multi-tier communication. Learned models require costly sampling, while analytical models often struggle with real-world network and hardware complexities. We address this by decomposing LLMs into core computational primitives and modeling them with: (1) operator-level decomposition for fine-grained analysis; (2) lightweight sampling based hardware-aware prediction models for key operations; (3) an end-to-end prediction system integrating these components across complex parallelization strategies. Crucially, our methodology has been validated on two large-scale HPC systems. Our framework achieves low average prediction errors-4.98% on Perlmutter(A100) and 9.38% on Vista(GH200)-for models up to 20B parameters across 128 GPUs. Importantly, it runs entirely on CPUs, enabling rapid iteration over hardware configurations and training strategies without costly on-cluster experimentation.