Pseudo-Asynchronous Local SGD: Robust and Efficient Data-Parallel Training

📅 2025-04-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In large-scale data-parallel training, frequent global communication severely limits scalability and robustness. To address this, we propose Pseudo-Asynchronous Local SGD (PALSGD), a novel algorithm that dynamically elongates synchronization intervals via a pseudo-synchronous mechanism—reducing communication frequency while preserving model consistency, thereby relaxing the strict synchronization requirements of conventional Local SGD and DiLoCo. Theoretically, we provide the first rigorous convergence proof and explicit convergence rate analysis for this class of algorithms. Technically, PALSGD integrates gradient delay compensation, dynamic synchronization scheduling, and consistency-preserving mechanisms. Experiments on ImageNet-1K and TinyStories demonstrate that PALSGD achieves up to 24.4% speedup over DDP with no accuracy degradation, significantly improving training efficiency and scalability.

Technology Category

Application Category

📝 Abstract
Following AI scaling trends, frontier models continue to grow in size and continue to be trained on larger datasets. Training these models requires huge investments in exascale computational resources, which has in turn driven development of distributed deep learning methods. Data parallelism is an essential approach to speed up training, but it requires frequent global communication between workers, which can bottleneck training at the largest scales. In this work, we propose a method called Pseudo-Asynchronous Local SGD (PALSGD) to improve the efficiency of data-parallel training. PALSGD is an extension of Local SGD (Stich, 2018) and DiLoCo (Douillard et al., 2023), designed to further reduce communication frequency by introducing a pseudo-synchronization mechanism. PALSGD allows the use of longer synchronization intervals compared to standard Local SGD. Despite the reduced communication frequency, the pseudo-synchronization approach ensures that model consistency is maintained, leading to performance results comparable to those achieved with more frequent synchronization. Furthermore, we provide a theoretical analysis of PALSGD, establishing its convergence and deriving its convergence rate. This analysis offers insights into the algorithm's behavior and performance guarantees. We evaluated PALSGD on image classification and language modeling tasks. Our results show that PALSGD achieves better performance in less time compared to existing methods like Distributed Data Parallel (DDP), and DiLoCo. Notably, PALSGD trains 18.4% faster than DDP on ImageNet-1K with ResNet-50, 24.4% faster than DDP on TinyStories with GPT-Neo125M, and 21.1% faster than DDP on TinyStories with GPT-Neo-8M.
Problem

Research questions and friction points this paper is trying to address.

Reducing communication bottlenecks in large-scale distributed deep learning
Maintaining model consistency with less frequent synchronization
Improving training efficiency compared to existing data-parallel methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pseudo-synchronization mechanism reduces communication frequency
Maintains model consistency with longer synchronization intervals
Achieves faster training compared to existing methods
🔎 Similar Papers
No similar papers found.
H
Hiroki Naganuma
Mila, Université de Montréal
X
Xinzhi Zhang
University of Washington
P
P. Witte
Microsoft
R
Russell J. Hewett
Microsoft
Man-Chung Yue
Man-Chung Yue
Assistant Professor, The University of Hong Kong
OptimizationData ScienceOperations ResearchSignal Processing
I
Ioannis Mitliagkas
Mila, Canada CIFAR AI Chair