🤖 AI Summary
Large-scale models often exceed the memory capacity of a single GPU, while inter-node gradient communication incurs high overhead and momentum synchronization proves inefficient. To address these challenges, this paper proposes FlexDeMo—a distributed training method that integrates Decoupled Momentum (DeMo) with Hybrid Sharding. Its core innovation lies in performing full momentum and gradient synchronization within each node, while transmitting only the rapidly varying gradient components across nodes and accumulating momentum locally—thereby decoupling communication from optimizer state updates. FlexDeMo is the first to achieve deep synergy between DeMo and hybrid sharding (combining tensor, data, and optimizer sharding), fully supporting AdamW and unifying full and hybrid sharding under a single framework. Experiments demonstrate that FlexDeMo matches AdamW’s convergence behavior across multi-node, multi-GPU setups, significantly reduces communication volume, improves training throughput, and enhances stability.
📝 Abstract
Training large neural network models requires extensive computational resources, often distributed across several nodes and accelerators. Recent findings suggest that it may be sufficient to only exchange the fast moving components of the gradients, while accumulating momentum locally (Decoupled Momentum, or DeMo). However, when considering larger models that do not fit on a single accelerate, the exchange of gradient information and the integration of DeMo needs to be reconsidered. Here, we propose employing a hybrid strategy, FlexDeMo, whereby nodes fully synchronize locally between different GPUs and inter-node communication is improved through only using the fast-moving components. This effectively combines previous hybrid sharding strategies with the advantages of decoupled momentum. Our experimental results show that FlexDeMo is on par with AdamW in terms of validation loss, demonstrating its viability.