Sign-SGD is the Golden Gate between Multi-Node to Single-Node Learning: Significant Boost via Parameter-Free Optimization

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Sign-SGD suffers from impractical step-size tuning in large language model training, as its theoretical step size depends on unknown data-dependent parameters. To address this, we propose the first parameter-agnostic Sign-SGD framework that unifies memory-efficient single-node training and gradient-compressed multi-node distributed optimization. Our contributions are threefold: (1) the first adaptive step-size design applicable deterministically, stochastically, and in distributed settings; (2) a momentum-enhanced variant that improves stability and convergence; and (3) an empirically grounded convergence analysis theory. Experiments on realistic tasks demonstrate that our method significantly reduces communication overhead and memory footprint—by up to 99% in gradient transmission—while maintaining convergence speed and final accuracy comparable to standard SGD. This resolves the key practical bottleneck hindering Sign-SGD’s adoption in large-scale LLM training.

Technology Category

Application Category

📝 Abstract
Quite recently, large language models have made a significant breakthrough across various disciplines. However, training them is an extremely resource-intensive task, even for major players with vast computing resources. One of the methods gaining popularity in light of these challenges is Sign-SGD. This method can be applied both as a memory-efficient approach in single-node training and as a gradient compression technique in the distributed learning. Nevertheless, it is impossible to automatically determine the effective stepsize from the theoretical standpoint. Indeed, it depends on the parameters of the dataset to which we do not have access in the real-world learning paradigm. To address this issue, we design several variants of single-node deterministic Sign-SGD. We extend our approaches to practical scenarios: stochastic single-node and multi-node learning, methods with incorporated momentum. We conduct extensive experiments on real machine learning problems that emphasize the practical applicability of our ideas.
Problem

Research questions and friction points this paper is trying to address.

Determining effective stepsize in Sign-SGD theoretically impossible
Lack of access to dataset parameters in real-world learning
Need for practical Sign-SGD variants in single and multi-node learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Parameter-free optimization for Sign-SGD
Extends Sign-SGD to stochastic single-node learning
Applies Sign-SGD to multi-node with momentum
🔎 Similar Papers
No similar papers found.