Accelerated Decentralized Stochastic Gradient Descent for Strongly Convex Optimization

📅 2026-06-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge in decentralized stochastic strongly convex optimization of simultaneously accelerating dependence on both the condition number κ and the network spectral gap (1−β). The authors propose Multi-round Gossip-accelerated Decentralized Stochastic Gradient Descent (MG-ADSGD), which synergistically integrates Nesterov-type primal-dual extrapolation, multi-round accelerated gossip averaging, and mini-batch gradient estimation to jointly optimize communication and computation. This method achieves, for the first time in this setting, acceleration with respect to both κ and (1−β), attaining the current best-known communication complexity of Õ(σ²/(μnε)·log(1/ε) + √(κ/(1−β))·log(1/ε)), where the logarithmic factors are independent of the target accuracy ε.

📝 Abstract

Decentralized stochastic optimization is a fundamental paradigm for large-scale learning over networks, where agents communicate only with their neighbors and no central coordinator is required. For strongly convex problems, communication efficiency is mainly determined by the condition number $κ=L/μ$ and the network spectral gap $1-β$. Although deterministic decentralized methods can simultaneously achieve accelerated $\sqrtκ$ and $1/\sqrt{1-β}$ dependences, no existing stochastic method attains both improvements at once. In this paper, we propose \emph{Multi-Gossip Accelerated DSGD} (MG-ADSGD), a decentralized stochastic algorithm that combines Nesterov-type primal--dual extrapolation with multi-round fast gossip averaging. The key idea is to couple the gossip depth with the mini-batch size so that additional communication rounds simultaneously improve consensus accuracy and reduce gradient variance. We show that MG-ADSGD achieves the communication complexity \[ \widetilde{\mathcal O}\!\left( \frac{σ^2}{μnε}\log\frac{1}ε + \sqrt{\fracκ{1-β}}\log\frac{1}ε \right), \] where $ε$ denotes the target accuracy, $n$ is the number of nodes, and $σ^2$ is the gradient variance. To the best of our knowledge, this bound yields the best currently available communication complexity for decentralized stochastic strongly convex optimization, up to logarithmic factors that are independent of $ε$.

Problem

Research questions and friction points this paper is trying to address.

decentralized stochastic optimization

strongly convex optimization

communication complexity

condition number

network spectral gap

Innovation

Methods, ideas, or system contributions that make the work stand out.

decentralized optimization

stochastic gradient descent

acceleration