Asynchronous Stochastic Gradient Descent with Decoupled Backpropagation and Layer-Wise Updates

📅 2024-10-08
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high communication and synchronization overheads and low hardware utilization in large-scale distributed training, this paper proposes Partially Decoupled Asynchronous Stochastic Gradient Descent (PD-ASGD). PD-ASGD is the first method to decouple forward and backward propagation threads with tunable execution ratios; it introduces layer-granularity asynchronous parameter updates to mitigate gradient staleness; and it theoretically models gradient bias while providing a rigorous convergence proof. Compared to synchronous data parallelism, PD-ASGD achieves a 5.95× speedup under high network latency; it outperforms existing ASGD variants by 2.14×; and it significantly improves FLOPs utilization—without sacrificing state-of-the-art (SOTA) model accuracy.

Technology Category

Application Category

📝 Abstract
The increasing size of deep learning models has made distributed training across multiple devices essential. However, current methods such as distributed data-parallel training suffer from large communication and synchronization overheads when training across devices, leading to longer training times as a result of suboptimal hardware utilization. Asynchronous stochastic gradient descent (ASGD) methods can improve training speed, but are sensitive to delays due to both communication and differences throughput. Moreover, the backpropagation algorithm used within ASGD workers is bottlenecked by the interlocking between its forward and backward passes. Current methods also do not take advantage of the large differences in the computation required for the forward and backward passes. Therefore, we propose an extension to ASGD called Partial Decoupled ASGD (PD-ASGD) that addresses these issues. PD-ASGD uses separate threads for the forward and backward passes, decoupling the updates and allowing for a higher ratio of forward to backward threads than the usual 1:1 ratio, leading to higher throughput. PD-ASGD also performs layer-wise (partial) model updates concurrently across multiple threads. This reduces parameter staleness and consequently improves robustness to delays. Our approach yields close to state-of-the-art results while running up to $5.95 imes$ faster than synchronous data parallelism in the presence of delays, and up to $2.14 imes$ times faster than comparable ASGD algorithms by achieving higher model flops utilization. We mathematically describe the gradient bias introduced by our method, establish an upper bound, and prove convergence.
Problem

Research questions and friction points this paper is trying to address.

Reduces communication and synchronization overheads
Decouples forward and backward passes in ASGD
Enhances robustness to delays in distributed training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples forward and backward passes
Implements layer-wise concurrent updates
Enhances model flops utilization significantly
🔎 Similar Papers
No similar papers found.
C
Cabrel Teguemne Fokam
Institut für Neuroinformatik, Ruhr Universität Bochum, Germany
K
Khaleelulla Khan Nazeer
Chair of Highly-Parallel VLSI Systems and Neuro-Microelectronics, Technische Universität Dresden, Germany
L
Lukas König
Institut für Neuroinformatik, Ruhr Universität Bochum, Germany
David Kappel
David Kappel
Bielefeld University
efficient machine learningneuromorphic engineeringcomputational neuroscience
A
Anand Subramoney
Department of Computer Science, Royal Holloway, University of London, United Kingdom