Asynchronous Stochastic Gradient Descent with Decoupled Backpropagation and Layer-Wise Updates

📅 2024-10-08

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address high communication and synchronization overheads and low hardware utilization in large-scale distributed training, this paper proposes Partially Decoupled Asynchronous Stochastic Gradient Descent (PD-ASGD). PD-ASGD is the first method to decouple forward and backward propagation threads with tunable execution ratios; it introduces layer-granularity asynchronous parameter updates to mitigate gradient staleness; and it theoretically models gradient bias while providing a rigorous convergence proof. Compared to synchronous data parallelism, PD-ASGD achieves a 5.95× speedup under high network latency; it outperforms existing ASGD variants by 2.14×; and it significantly improves FLOPs utilization—without sacrificing state-of-the-art (SOTA) model accuracy.

Technology Category

Application Category

📝 Abstract

The increasing size of deep learning models has made distributed training across multiple devices essential. However, current methods such as distributed data-parallel training suffer from large communication and synchronization overheads when training across devices, leading to longer training times as a result of suboptimal hardware utilization. Asynchronous stochastic gradient descent (ASGD) methods can improve training speed, but are sensitive to delays due to both communication and differences throughput. Moreover, the backpropagation algorithm used within ASGD workers is bottlenecked by the interlocking between its forward and backward passes. Current methods also do not take advantage of the large differences in the computation required for the forward and backward passes. Therefore, we propose an extension to ASGD called Partial Decoupled ASGD (PD-ASGD) that addresses these issues. PD-ASGD uses separate threads for the forward and backward passes, decoupling the updates and allowing for a higher ratio of forward to backward threads than the usual 1:1 ratio, leading to higher throughput. PD-ASGD also performs layer-wise (partial) model updates concurrently across multiple threads. This reduces parameter staleness and consequently improves robustness to delays. Our approach yields close to state-of-the-art results while running up to $5.95 imes$ faster than synchronous data parallelism in the presence of delays, and up to $2.14 imes$ times faster than comparable ASGD algorithms by achieving higher model flops utilization. We mathematically describe the gradient bias introduced by our method, establish an upper bound, and prove convergence.

Problem

Research questions and friction points this paper is trying to address.

Reduces communication and synchronization overheads

Decouples forward and backward passes in ASGD

Enhances robustness to delays in distributed training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples forward and backward passes

Implements layer-wise concurrent updates

Enhances model flops utilization significantly

🔎 Similar Papers

No similar papers found.

Authors to Follow