Performance of Small Language Model Pretraining on FABRIC: An Empirical Study

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses the challenge of efficiently pretraining small language models under constrained computational and data resources. Leveraging the FABRIC platform, the study systematically evaluates data parallelism, intra-operator parallelism, pipeline parallelism, and their combinations across both homogeneous and heterogeneous GPU clusters, while quantifying the impact of network latency on geographically distributed training. Building upon the Alpa and Ray frameworks and using GPT-2 medium and large models, the authors propose a hardware- and network-aware parallelism strategy selection method. Experimental results demonstrate that Alpa’s joint optimization of intra-operator and pipeline parallelism achieves superior performance in high-latency environments (tens of milliseconds), significantly reducing the number of required GPUs and enhancing training efficiency.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) require enormous computing power to pretrain on massive datasets. When limited datasets are available, smaller-sized LLMs are better choice to pretrain (on user-specified datasets) by following the scaling laws of LLMs. Using pretrained models, vector embeddings can be generated for raw data and stored using vector databases to support modern AI applications and semantic search. In this work, we investigate the performance of pretraining techniques for smaller-sized LLMs on an experimental testbed (with commodity GPUs) available to academic users at no charge. We consider data parallelism, intra-operator parallelism, and inter-operator/pipeline parallelism, and their combinations for pretraining. We set up different GPU clusters with homogeneous and heterogeneous GPU hardware. Furthermore, we investigate the impact of network latency on pretraining performance especially when GPUs are geographically distributed. We used GPT-2 medium and large models and pretrained them using open-source packages, namely, Alpa and Ray. We observed that Alpa's execution plans that collectively optimized intra-operator and inter-operator/pipeline parallelism consistently performed the best when GPUs were geographically distributed. This was especially true when the network latencies were in 10's of milliseconds. Based on the insights gained from the experiments, we propose a systematic approach for selecting the appropriate pretraining technique to achieve high training performance/lower execution time as well as to reduce the number of GPUs used.

Problem

Research questions and friction points this paper is trying to address.

small language models

pretraining

distributed training

network latency

GPU clusters

Innovation

Methods, ideas, or system contributions that make the work stand out.

small language model pretraining

parallelism strategies

geographically distributed GPUs