🤖 AI Summary
Sign-SGD suffers from impractical step-size tuning in large language model training, as its theoretical step size depends on unknown data-dependent parameters. To address this, we propose the first parameter-agnostic Sign-SGD framework that unifies memory-efficient single-node training and gradient-compressed multi-node distributed optimization. Our contributions are threefold: (1) the first adaptive step-size design applicable deterministically, stochastically, and in distributed settings; (2) a momentum-enhanced variant that improves stability and convergence; and (3) an empirically grounded convergence analysis theory. Experiments on realistic tasks demonstrate that our method significantly reduces communication overhead and memory footprint—by up to 99% in gradient transmission—while maintaining convergence speed and final accuracy comparable to standard SGD. This resolves the key practical bottleneck hindering Sign-SGD’s adoption in large-scale LLM training.
📝 Abstract
Quite recently, large language models have made a significant breakthrough across various disciplines. However, training them is an extremely resource-intensive task, even for major players with vast computing resources. One of the methods gaining popularity in light of these challenges is Sign-SGD. This method can be applied both as a memory-efficient approach in single-node training and as a gradient compression technique in the distributed learning. Nevertheless, it is impossible to automatically determine the effective stepsize from the theoretical standpoint. Indeed, it depends on the parameters of the dataset to which we do not have access in the real-world learning paradigm. To address this issue, we design several variants of single-node deterministic Sign-SGD. We extend our approaches to practical scenarios: stochastic single-node and multi-node learning, methods with incorporated momentum. We conduct extensive experiments on real machine learning problems that emphasize the practical applicability of our ideas.