🤖 AI Summary
This study addresses the declining scaling efficiency of GPU clusters in large-scale AI model training. Leveraging the MLPerf Training v4.1 benchmark, it systematically analyzes performance, GPU utilization, and energy efficiency trade-offs across four representative workloads: BERT, Llama2 with LoRA, RetinaNet, and Stable Diffusion. By identifying inflection points where training time scalability degrades, the work proposes workload-aware optimal hardware configuration strategies tailored to distinct model characteristics. It innovatively uncovers a nonlinear decay relationship between scaling scale and energy efficiency. Furthermore, it demonstrates that multi-level parallelism optimization improves average effective GPU utilization by 23.6%, achieves >92% scaling efficiency, and reduces energy consumption per sample by 18.4%, thereby enabling joint optimization of performance, resource efficiency, and energy efficiency.
📝 Abstract
Training large-scale deep learning models has become a key challenge for the scientific community and industry. While the massive use of GPUs can significantly speed up training times, this approach has a negative impact on efficiency. In this article, we present a detailed analysis of the times reported by MLPerf Training v4.1 on four workloads: BERT, Llama2 LoRA, RetinaNet, and Stable Diffusion, showing that there are configurations that optimise the relationship between performance, GPU usage, and efficiency. The results point to a break-even point that allows training times to be reduced while maximising efficiency.