Estudio de la eficiencia en la escalabilidad de GPUs para el entrenamiento de Inteligencia Artificial

📅 2025-09-03

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the declining scaling efficiency of GPU clusters in large-scale AI model training. Leveraging the MLPerf Training v4.1 benchmark, it systematically analyzes performance, GPU utilization, and energy efficiency trade-offs across four representative workloads: BERT, Llama2 with LoRA, RetinaNet, and Stable Diffusion. By identifying inflection points where training time scalability degrades, the work proposes workload-aware optimal hardware configuration strategies tailored to distinct model characteristics. It innovatively uncovers a nonlinear decay relationship between scaling scale and energy efficiency. Furthermore, it demonstrates that multi-level parallelism optimization improves average effective GPU utilization by 23.6%, achieves >92% scaling efficiency, and reduces energy consumption per sample by 18.4%, thereby enabling joint optimization of performance, resource efficiency, and energy efficiency.

Technology Category

Application Category

📝 Abstract

Training large-scale deep learning models has become a key challenge for the scientific community and industry. While the massive use of GPUs can significantly speed up training times, this approach has a negative impact on efficiency. In this article, we present a detailed analysis of the times reported by MLPerf Training v4.1 on four workloads: BERT, Llama2 LoRA, RetinaNet, and Stable Diffusion, showing that there are configurations that optimise the relationship between performance, GPU usage, and efficiency. The results point to a break-even point that allows training times to be reduced while maximising efficiency.

Problem

Research questions and friction points this paper is trying to address.

Analyzing GPU scalability efficiency in AI training

Optimizing performance and GPU usage trade-offs

Identifying break-even point for training efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

GPU scalability analysis for AI training

MLPerf benchmark optimization configurations

Performance-efficiency trade-off breakthrough point

🔎 Similar Papers

No similar papers found.

Authors to Follow