π€ AI Summary
This work addresses the challenge of efficiently pretraining small language models under constrained computational and data resources. Leveraging the FABRIC platform, the study systematically evaluates data parallelism, intra-operator parallelism, pipeline parallelism, and their combinations across both homogeneous and heterogeneous GPU clusters, while quantifying the impact of network latency on geographically distributed training. Building upon the Alpa and Ray frameworks and using GPT-2 medium and large models, the authors propose a hardware- and network-aware parallelism strategy selection method. Experimental results demonstrate that Alpaβs joint optimization of intra-operator and pipeline parallelism achieves superior performance in high-latency environments (tens of milliseconds), significantly reducing the number of required GPUs and enhancing training efficiency.
π Abstract
Large language models (LLMs) require enormous computing power to pretrain on massive datasets. When limited datasets are available, smaller-sized LLMs are better choice to pretrain (on user-specified datasets) by following the scaling laws of LLMs. Using pretrained models, vector embeddings can be generated for raw data and stored using vector databases to support modern AI applications and semantic search. In this work, we investigate the performance of pretraining techniques for smaller-sized LLMs on an experimental testbed (with commodity GPUs) available to academic users at no charge. We consider data parallelism, intra-operator parallelism, and inter-operator/pipeline parallelism, and their combinations for pretraining. We set up different GPU clusters with homogeneous and heterogeneous GPU hardware. Furthermore, we investigate the impact of network latency on pretraining performance especially when GPUs are geographically distributed. We used GPT-2 medium and large models and pretrained them using open-source packages, namely, Alpa and Ray. We observed that Alpa's execution plans that collectively optimized intra-operator and inter-operator/pipeline parallelism consistently performed the best when GPUs were geographically distributed. This was especially true when the network latencies were in 10's of milliseconds. Based on the insights gained from the experiments, we propose a systematic approach for selecting the appropriate pretraining technique to achieve high training performance/lower execution time as well as to reduce the number of GPUs used.