SGD with memory: fundamental properties and stochastic acceleration

📅 2024-10-05

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

246K/year

🤖 AI Summary

Mini-batch SGD suffers from slow convergence in quadratic optimization with power-law spectra due to fundamental noise-induced bottlenecks. Method: We propose memory-M SGD—a novel framework introducing gradient memory into stochastic optimization. We establish the first polynomial characterization and signal/noise propagation analysis for memory-M algorithms, modeling dynamics via characteristic polynomials and decomposing signal and noise evolution under time-varying learning rates. Contribution/Results: We prove that memory-1 SGD achieves arbitrary reduction of the convergence constant while preserving numerical stability; further, we design a time-varying learning rate schedule that yields strict exponential acceleration. Empirical results confirm that the proposed time-varying memory-1 SGD significantly outperforms standard SGD in convergence rate. Our core contribution is breaking the theoretical barrier on convergence acceleration for quadratic problems under mini-batch noise—delivering the first provably exponentially accelerated, polynomially analyzable algorithmic paradigm for memory-augmented stochastic optimization.

Technology Category

Application Category

📝 Abstract

An important open problem is the theoretically feasible acceleration of mini-batch SGD-type algorithms on quadratic problems with power-law spectrum. In the non-stochastic setting, the optimal exponent $xi$ in the loss convergence $L_tsim C_Lt^{-xi}$ is double that in plain GD and is achievable using Heavy Ball (HB) with a suitable schedule; this no longer works in the presence of mini-batch noise. We address this challenge by considering first-order methods with an arbitrary fixed number $M$ of auxiliary velocity vectors (*memory-$M$ algorithms*). We first prove an equivalence between two forms of such algorithms and describe them in terms of suitable characteristic polynomials. Then we develop a general expansion of the loss in terms of signal and noise propagators. Using it, we show that losses of stationary stable memory-$M$ algorithms always retain the exponent $xi$ of plain GD, but can have different constants $C_L$ depending on their effective learning rate that generalizes that of HB. We prove that in memory-1 algorithms we can make $C_L$ arbitrarily small while maintaining stability. As a consequence, we propose a memory-1 algorithm with a time-dependent schedule that we show heuristically and experimentally to improve the exponent $xi$ of plain SGD.

Problem

Research questions and friction points this paper is trying to address.

Accelerating mini-batch SGD on quadratic problems with power-law spectrum.

Exploring memory-M algorithms to improve loss convergence exponents.

Proposing a memory-1 algorithm to enhance SGD performance.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Memory-M algorithms enhance SGD convergence.

Characteristic polynomials describe algorithm equivalence.

Memory-1 algorithm improves SGD exponent heuristically.

🔎 Similar Papers

Increasing Both Batch Size and Learning Rate Accelerates Stochastic Gradient Descent