🤖 AI Summary
This study addresses the scalability challenges of training electrocardiogram (ECG) modality large language models (ECG-LLMs) on high-performance computing (HPC) systems with multiple GPUs. We systematically evaluate distributed training scalability using SLURM job scheduling, Apptainer containers, and mainstream frameworks—including PyTorch, Horovod, and DeepSpeed—conducting cross-framework, multi-configuration empirical benchmarks on a production HPC platform. To our knowledge, this is the first HPC-scale scalability benchmark specifically for ECG-LLMs. Experimental results demonstrate a 1.9× sublinear speedup under 4-GPU scaling, quantitatively revealing communication overhead and I/O bottlenecks as critical, domain-specific efficiency constraints inherent to medical time-series data. The study delivers a reproducible scalability benchmark and actionable optimization insights for efficient training of healthcare AI models on supercomputing infrastructure.
📝 Abstract
Training large language models requires extensive processing, made possible by many high-performance computing resources. This study compares multi-node and multi-GPU environments for training large language models of electrocardiograms. It provides a detailed mapping of current frameworks for distributed deep learning in multinode and multi-GPU settings, including Horovod from Uber, DeepSpeed from Microsoft, and the built-in distributed capabilities of PyTorch and TensorFlow. We compare various multi-GPU setups for different dataset configurations, utilizing multiple HPC nodes independently and focusing on scalability, speedup, efficiency, and overhead. The analysis leverages HPC infrastructure with SLURM, Apptainer (Singularity) containers, CUDA, PyTorch, and shell scripts to support training workflows and automation. We achieved a sub-linear speedup when scaling the number of GPUs, with values of 1.6x for two and 1.9x for four.