🤖 AI Summary
This work addresses the challenges of arbitrary asynchrony and computational heterogeneity in distributed learning by proposing OrLoMo, the first asynchronous momentum SGD algorithm that supports local updates. The key innovation lies in an ordered momentum aggregation mechanism: each worker performs local momentum SGD independently, while the server aggregates local momenta in the order of global iterations. Theoretical analysis establishes that OrLoMo guarantees convergence under non-convex objectives for arbitrary delays. Experimental results demonstrate that the method significantly outperforms existing synchronous and asynchronous baselines in both convergence speed and generalization performance.
📝 Abstract
Momentum SGD (MSGD) serves as a foundational optimizer in training deep models due to momentum's key role in accelerating convergence and enhancing generalization. Meanwhile, asynchronous distributed learning is crucial for training large-scale deep models, especially when the computing capabilities of the workers in the cluster are heterogeneous. To reduce communication frequency, local updates are widely adopted in distributed learning. However, how to implement asynchronous distributed MSGD with local updates remains unexplored. To solve this problem, we propose a novel method, called \underline{or}dered \underline{lo}cal \underline{mo}mentum (OrLoMo), for asynchronous distributed learning. In OrLoMo, each worker runs MSGD locally. Then the local momentum from each worker will be aggregated by the server in order based on its global iteration index. To the best of our knowledge, OrLoMo is the first method to implement asynchronous distributed MSGD with local updates. We prove the convergence of OrLoMo for non-convex problems under arbitrary delays. Experiments validate that OrLoMo can outperform its synchronous counterpart and other asynchronous methods.