Scalability Evaluation of HPC Multi-GPU Training for ECG-based LLMs

📅 2025-03-26

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

This study addresses the scalability challenges of training electrocardiogram (ECG) modality large language models (ECG-LLMs) on high-performance computing (HPC) systems with multiple GPUs. We systematically evaluate distributed training scalability using SLURM job scheduling, Apptainer containers, and mainstream frameworks—including PyTorch, Horovod, and DeepSpeed—conducting cross-framework, multi-configuration empirical benchmarks on a production HPC platform. To our knowledge, this is the first HPC-scale scalability benchmark specifically for ECG-LLMs. Experimental results demonstrate a 1.9× sublinear speedup under 4-GPU scaling, quantitatively revealing communication overhead and I/O bottlenecks as critical, domain-specific efficiency constraints inherent to medical time-series data. The study delivers a reproducible scalability benchmark and actionable optimization insights for efficient training of healthcare AI models on supercomputing infrastructure.

Technology Category

Application Category

📝 Abstract

Training large language models requires extensive processing, made possible by many high-performance computing resources. This study compares multi-node and multi-GPU environments for training large language models of electrocardiograms. It provides a detailed mapping of current frameworks for distributed deep learning in multinode and multi-GPU settings, including Horovod from Uber, DeepSpeed from Microsoft, and the built-in distributed capabilities of PyTorch and TensorFlow. We compare various multi-GPU setups for different dataset configurations, utilizing multiple HPC nodes independently and focusing on scalability, speedup, efficiency, and overhead. The analysis leverages HPC infrastructure with SLURM, Apptainer (Singularity) containers, CUDA, PyTorch, and shell scripts to support training workflows and automation. We achieved a sub-linear speedup when scaling the number of GPUs, with values of 1.6x for two and 1.9x for four.

Problem

Research questions and friction points this paper is trying to address.

Evaluates multi-GPU scalability for ECG-based LLM training

Compares distributed frameworks like Horovod, DeepSpeed, PyTorch, TensorFlow

Analyzes speedup and efficiency in HPC multi-node GPU setups

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes multi-node and multi-GPU HPC environments

Compares Horovod, DeepSpeed, PyTorch, TensorFlow frameworks

Leverages SLURM, Apptainer, CUDA for scalable training

🔎 Similar Papers

DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models