🤖 AI Summary
This work addresses decentralized stochastic smooth convex optimization over a fixed communication network, aiming to maximize the number of participating nodes $M$ under a total gradient sample budget $N$ while preserving the optimal statistical convergence rate of $O(1/\sqrt{N})$ achievable by centralized methods. To this end, the authors propose a novel algorithm that integrates accelerated gossip communication, mini-batch gradients, and a single-step delayed acceleration mechanism. This approach effectively controls the residual inconsistency among nodes and exhibits only logarithmic dependence on local data heterogeneity. The method achieves a significantly improved scalability bound of $M \lesssim \sqrt{\rho}\, N^{3/4}$, where $\rho$ denotes the network spectral gap, surpassing the previous best-known bound of $M \lesssim \rho \sqrt{N}$. Moreover, the authors establish the optimality of this bound for first-order methods within the linear span class.
📝 Abstract
We study decentralized stochastic smooth convex optimization, where $M$ workers minimize an average objective using local stochastic gradients and neighbor-only communication over a fixed gossip network. A central question in this setting is to determine the largest number of workers that can be used under a total budget of $N$ gradient samples while still preserving the centralized $O(1/\sqrt N)$ statistical rate. We introduce an accelerated decentralized method that preserves this rate for up to $\smash{M\lesssim \sqrtρ\,N^{3/4}}$ workers, where $ρ$ is the spectral gap of the gossip network, improving the best prior maximal scaling of $\smash{M\lesssim ρ\sqrt N}$. The method is based on a one-step-delayed stochastic acceleration scheme that enables workers to interleave minibatching with accelerated gossip while controlling residual disagreement, and its guarantee depends only logarithmically on the optimum-local heterogeneity. We also establish a matching lower bound for linear-span decentralized first-order methods, showing that the method is optimal up to logarithmic factors.