🤖 AI Summary
This work addresses the high communication cost in distributed learning, where existing methods typically exhibit communication complexity dependent on the total number of samples \(N\). The authors propose a novel distributed optimization framework that integrates local updates with variance reduction techniques, achieving—for the first time—a communication complexity that depends only on the number of worker nodes \(M\) and is independent of \(N\). When \(M = O(N^{1/4})\), the proposed method outperforms the current state-of-the-art algorithms. Under typical settings, the framework substantially reduces communication overhead and demonstrates superior empirical performance compared to strong baselines such as Minibatch Accelerated SGD.
📝 Abstract
Communication overhead is a crucial bottleneck in scalable distributed learning. While existing methods aim to efficiently utilize data points, such as Local SGD, Minibatch SGD, and their accelerated variants, they still exhibit communication-round complexity that scales with the total number of samples $N$. In this paper, we introduce Local MixVR, a distributed framework that integrates local updates with variance-reduction techniques to mitigate local noise. We show that Local MixVR is the first distributed method to eliminate the dependence of communication complexity on $N$, achieving a complexity that scales only with the number of workers $M$. In common regimes where $M<O\left(N^{1/4}\right)$, Local MixVR outperforms the state-of-the-art Minibatch Accelerated SGD baseline, bridging a long-standing gap in distributed optimization and establishing a new paradigm for communication-efficient training.