Estudio de la eficiencia en la escalabilidad de GPUs para el entrenamiento de Inteligencia Artificial

📅 2025-09-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the declining scaling efficiency of GPU clusters in large-scale AI model training. Leveraging the MLPerf Training v4.1 benchmark, it systematically analyzes performance, GPU utilization, and energy efficiency trade-offs across four representative workloads: BERT, Llama2 with LoRA, RetinaNet, and Stable Diffusion. By identifying inflection points where training time scalability degrades, the work proposes workload-aware optimal hardware configuration strategies tailored to distinct model characteristics. It innovatively uncovers a nonlinear decay relationship between scaling scale and energy efficiency. Furthermore, it demonstrates that multi-level parallelism optimization improves average effective GPU utilization by 23.6%, achieves >92% scaling efficiency, and reduces energy consumption per sample by 18.4%, thereby enabling joint optimization of performance, resource efficiency, and energy efficiency.

Technology Category

Application Category

📝 Abstract
Training large-scale deep learning models has become a key challenge for the scientific community and industry. While the massive use of GPUs can significantly speed up training times, this approach has a negative impact on efficiency. In this article, we present a detailed analysis of the times reported by MLPerf Training v4.1 on four workloads: BERT, Llama2 LoRA, RetinaNet, and Stable Diffusion, showing that there are configurations that optimise the relationship between performance, GPU usage, and efficiency. The results point to a break-even point that allows training times to be reduced while maximising efficiency.
Problem

Research questions and friction points this paper is trying to address.

Analyzing GPU scalability efficiency in AI training
Optimizing performance and GPU usage trade-offs
Identifying break-even point for training efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

GPU scalability analysis for AI training
MLPerf benchmark optimization configurations
Performance-efficiency trade-off breakthrough point
🔎 Similar Papers
No similar papers found.
D
David Cortes
Dpto. de Ciencias Matemáticas e Informática, Universidad de las Islas Baleares
C
Carlos Juiz
Dpto. de Ciencias Matemáticas e Informática, Universidad de las Islas Baleares
Belen Bermejo
Belen Bermejo
Dr. Computer Science, Universitat de les Illes Balears
Performance engineeringVirtualizationCloud ComputingEnergy efficiency