Increasing Both Batch Size and Learning Rate Accelerates Stochastic Gradient Descent

📅 2024-09-13

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work investigates the joint scheduling of batch size and learning rate in mini-batch stochastic gradient descent (SGD), aiming to accelerate empirical risk minimization while reducing the full-gradient norm. Grounded in stochastic optimization theory, we provide the first rigorous proof that *synchronously increasing* both batch size and learning rate—including warm-up decay variants—achieves faster convergence, breaking the conventional paradigm of “fixed or increasing batch size with learning rate decay.” We propose four joint scheduling strategies, derive expectation-based convergence bounds on the full-gradient norm, and theoretically establish that two of these strategies significantly accelerate its decay rate. Numerical experiments demonstrate that the proposed strategies reduce required training iterations by 30%–50% compared to classical baselines, substantially improving convergence efficiency.

Technology Category

Application Category

📝 Abstract

The performance of mini-batch stochastic gradient descent (SGD) strongly depends on setting the batch size and learning rate to minimize the empirical loss in training the deep neural network. In this paper, we present theoretical analyses of mini-batch SGD with four schedulers: (i) constant batch size and decaying learning rate scheduler, (ii) increasing batch size and decaying learning rate scheduler, (iii) increasing batch size and increasing learning rate scheduler, and (iv) increasing batch size and warm-up decaying learning rate scheduler. We show that mini-batch SGD using scheduler (i) does not always minimize the expectation of the full gradient norm of the empirical loss, whereas it does using any of schedulers (ii), (iii), and (iv). Furthermore, schedulers (iii) and (iv) accelerate mini-batch SGD. The paper also provides numerical results of supporting analyses showing that using scheduler (iii) or (iv) minimizes the full gradient norm of the empirical loss faster than using scheduler (i) or (ii).

Problem

Research questions and friction points this paper is trying to address.

Optimizes batch size and learning rate

Accelerates stochastic gradient descent

Minimizes empirical loss gradient norm

Innovation

Methods, ideas, or system contributions that make the work stand out.

Increasing batch size accelerates SGD

Increasing learning rate enhances performance

Warm-up decaying scheduler optimizes training

🔎 Similar Papers

No similar papers found.