AdamX: An Adam improvement algorithm based on a novel exponential decay mechanism for the second-order moment estimate

📅 2025-11-17

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Adam often converges to sharp minima during large language model training, impairing generalization and training stability. To address this, we propose AdamX—a novel adaptive optimization algorithm that introduces a dynamically adjusted exponential decay mechanism for the second-moment estimate. This mechanism progressively reduces the magnitude of step-size correction during training and naturally transitions to standard SGD in later stages. Consequently, AdamX preserves Adam’s rapid early-stage convergence while inheriting SGD’s superior generalization properties in the final phase. Extensive experiments across diverse NLP benchmarks demonstrate that AdamX consistently outperforms Adam and its prominent variants—including AdamW and AdaFactor—yielding both enhanced training stability and improved final model performance. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Since the 21st century, artificial intelligence has been leading a new round of industrial revolution. Under the training framework, the optimization algorithm aims to stably converge high-dimensional optimization to local and even global minima. Entering the era of large language models, although the scale of model parameters and data has increased, Adam remains the mainstream optimization algorithm. However, compared with stochastic gradient descent (SGD) based optimization algorithms, Adam is more likely to converge to non-flat minima. To address this issue, the AdamX algorithm is proposed. Its core innovation lies in the proposition of a novel type of second-order moment estimation exponential decay rate, which gradually weakens the learning step correction strength as training progresses, and degrades to SGD in the stable training period, thereby improving the stability of training in the stable period and possibly enhancing generalization ability. Experimental results show that our second-order moment estimation exponential decay rate is better than the current second-order moment estimation exponential decay rate, and AdamX can stably outperform Adam and its variants in terms of performance. Our code is open-sourced at https://github.com/mengzhu0308/AdamX.

Problem

Research questions and friction points this paper is trying to address.

Adam optimizer converges to non-flat minima compared to SGD

Existing second-order moment estimation lacks adaptive decay mechanism

Need improved training stability and generalization for large models

Innovation

Methods, ideas, or system contributions that make the work stand out.

AdamX introduces novel exponential decay for second moments

It gradually reduces learning step correction during training

AdamX degrades to SGD in stable training periods

🔎 Similar Papers

Conjugate-Gradient-like Based Adaptive Moment Estimation Optimization Algorithm for Deep Learning