The Epochal Sawtooth Effect: Unveiling Training Loss Oscillations in Adam and Other Optimizers

📅 2024-10-14

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This paper identifies and systematically analyzes the “Epochal Sawtooth Effect” (ESE)—a pervasive periodic oscillation in adaptive optimizers (especially Adam) wherein training loss sharply drops at the start of each epoch and gradually rises thereafter. Method: Through empirical analysis, controlled ablation studies, quadratic-function simulations, and theoretical derivation, we formally name, model, and attribute ESE. Contribution/Results: We establish that ESE arises from Adam’s second-moment estimation (governed by β₂), which dynamically modulates the effective learning rate across iterations; β₂ critically determines both the amplitude and period of the sawtooth pattern. Experiments confirm ESE is most pronounced in Adam but also present in RMSProp and related adaptive methods. The effect is empirically reproducible, theoretically grounded, and generalizable across architectures and datasets. Our work provides a novel mechanistic lens for understanding the training dynamics of adaptive optimization, revealing previously overlooked epoch-level temporal structure induced by variance adaptation.

Technology Category

Application Category

📝 Abstract

In this paper, we identify and analyze a recurring training loss pattern, which we term the extit{Epochal Sawtooth Effect (ESE)}, commonly observed during training with adaptive gradient-based optimizers, particularly Adam optimizer. This pattern is characterized by a sharp drop in loss at the beginning of each epoch, followed by a gradual increase, resulting in a sawtooth-shaped loss curve. Through empirical observations, we demonstrate that while this effect is most pronounced with Adam, it persists, although less severely, with other optimizers such as RMSProp. We provide an in-depth explanation of the underlying mechanisms that lead to the Epochal Sawtooth Effect. The influences of factors like $eta$, batch size, data shuffling on this pattern have been studied. We quantify the influence of $beta_2$ on the shape of the loss curve, showing that higher values of $eta_2$ result in a nearly linear increase in loss, while lower values create a concave upward trend. Our analysis reveals that this behavior stems from the adaptive learning rate controlled by the second moment estimate, with $eta_1$ playing a minimal role when $eta_2$ is large. To support our analysis, we replicate this phenomenon through a controlled quadratic minimization task. By incrementally solving a series of quadratic optimization problems using Adam, we demonstrate that the Epochal Sawtooth Effect can emerge even in simple optimization scenarios, reinforcing the generality of this pattern. This paper provides both theoretical insights and quantitative analysis, offering a comprehensive understanding of this ubiquitous phenomenon in modern optimization techniques.

Problem

Research questions and friction points this paper is trying to address.

Analyzing the Epochal Sawtooth Effect in adaptive optimizers like Adam

Explaining mechanisms behind loss oscillations in gradient-based optimization

Quantifying impact of hyperparameters on sawtooth-shaped training loss patterns

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes Epochal Sawtooth Effect in Adam optimizer

Studies impact of beta2 and batch size on loss

Replicates effect via quadratic minimization task

🔎 Similar Papers

Spike No More: Stabilizing the Pre-training of Large Language Models