Efficient allocation of image recognition and LLM tasks on multi-GPU system

📅 2025-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses resource contention and load imbalance arising from co-training computer vision (CV) and large language model (LLM) tasks on multi-GPU systems. We propose a dynamic GPU allocation method tailored to heterogeneous workloads, integrating a data-type-aware LLM fine-tuning strategy with a fine-grained performance profiling framework. The approach enables adaptive scheduling and communication optimization for CV and LLM tasks sharing a GPU cluster. Built upon PyTorch’s distributed training primitives, it achieves coordinated optimization of memory, computation, and inter-GPU communication across modalities on NVIDIA H100 hardware. Experimental evaluation demonstrates, relative to baseline methods, an average 32% reduction in iteration time, a 27% improvement in GPU memory utilization, and a 41% decrease in communication overhead. Validation across benchmarks—including ImageNet-1K classification and LLaMA-2 fine-tuning—confirms simultaneous gains in resource efficiency and model accuracy.

Technology Category

Application Category

📝 Abstract
This work is concerned with the evaluation of the performance of parallelization of learning and tuning processes for image classification and large language models. For machine learning model in image recognition, various parallelization methods are developed based on different hardware and software scenarios: simple data parallelism, distributed data parallelism, and distributed processing. A detailed description of presented strategies is given, highlighting the challenges and benefits of their application. Furthermore, the impact of different dataset types on the tuning process of large language models is investigated. Experiments show to what extent the task type affects the iteration time in a multi-GPU environment, offering valuable insights into the optimal data utilization strategies to improve model performance. Furthermore, this study leverages the built-in parallelization mechanisms of PyTorch that can facilitate these tasks. Furthermore, performance profiling is incorporated into the study to thoroughly evaluate the impact of memory and communication operations during the training/tuning procedure. Test scenarios are developed and tested with numerous benchmarks on the NVIDIA H100 architecture showing efficiency through selected metrics.
Problem

Research questions and friction points this paper is trying to address.

Evaluates parallelization performance for image classification and LLMs.
Investigates dataset impact on LLM tuning in multi-GPU systems.
Optimizes data utilization strategies to enhance model performance.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallelization methods for multi-GPU image recognition
Impact of dataset types on LLM tuning processes
Performance profiling with PyTorch on NVIDIA H100
M
Marcin Lawenda
Poznan Supercomputing and Networking Center, Jana Pawła II 10, 61-139 Poznań, Poland
K
Krzesimir Samborski
Poznan Supercomputing and Networking Center, Jana Pawła II 10, 61-139 Poznań, Poland
K
Kyrylo Khloponin
Poznan Supercomputing and Networking Center, Jana Pawła II 10, 61-139 Poznań, Poland
Lukasz Szustak
Lukasz Szustak
PhD, Assistant Professor, Czestochowa University of Technology