🤖 AI Summary
This work addresses the detrimental impact of large delays caused by slow workers in asynchronous stochastic gradient descent (ASGD), which severely degrades convergence even under a fixed learning rate. The paper introduces, for the first time, a gradient clipping mechanism tailored to ASGD and establishes its convergence guarantees—both in expectation and with high probability—under a sub-Weibull model for gradient noise. This approach effectively eliminates the dependence of oracle complexity on the maximum delay, thereby significantly enhancing robustness against stragglers in both distributed and federated learning settings. Moreover, the analysis accommodates a broad class of heavy-tailed noise distributions, extending the applicability of ASGD beyond conventional sub-Gaussian assumptions.
📝 Abstract
In modern machine learning, parallelization of training is an important strategy for increasing scale. Asynchronous stochastic gradient descent (ASGD), which maximizes the utilization of available hardware by avoiding waiting for slow workers. However, with constant step sizes, the convergence of ASGD is nonetheless affected negatively by slow workers due to large delays in updates. At the same time, it has been empirically observed in asynchronous training of deep learning models that gradient clipping "stabilizes" training. In this work, we provide a theoretical justification for this behavior, as we show that clipping removes the dependence of the maximum delay in the oracle complexity. We employ a sub-Weibull model of gradient noise which generalizes sub-Gaussian and sub-exponential distributions to more heavy-tailed distributions, motivated by empirical observations in deep learning. We show convergence in expectation, and the first time in asynchronous optimization, convergence with high probability.