Deciphering Two Training Clocks in Grokking via Deep Linear Network Theory with Conditional ReLU Reduction

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work uncovers a “two-clock” phenomenon in deep neural network training, wherein rapid decay of classification loss and slow structural simplification of representations occur on distinct timescales. Through theoretical analysis of deep linear networks, the study formally characterizes this decoupling between loss minimization and representational simplification for the first time. The authors extend these insights to nonlinear multilayer perceptrons via a conditional ReLU reduction framework. By integrating Schatten norm regularization, the Kurdyka–Łojasiewicz inequality, and gradient dynamics analysis, they empirically validate a two-phase mechanism—“fitting first, simplifying later”—on modular arithmetic tasks, offering a rigorous dynamical explanation for the grokking phenomenon.

📝 Abstract

Grokking suggests that fitting the training data and learning a simple underlying rule may occur on different time scales. We formalize this phenomenon by separating the fast decay of the classification loss from the slower simplification of the learned representation, and we call the resulting pair of stopping times two training clocks. For deep linear networks, we show that a post-margin gap-growth or one-step tail-contraction condition reduces the cross-entropy loss to level epsilon on a logarithmic time scale. In contrast, when layerwise weight decay is present, the induced regularization on the end-to-end map can be expressed as a Schatten-type penalty; under a sharp late-time Kurdyka-Lojasiewicz tail, this structural energy closes on a polynomial time scale. The two clocks, therefore, separate fitting from representation simplification. We then explain how the same mechanism can appear in ReLU MLPs. In regions where the activation patterns on the training set remain fixed, the network reduces to a linear model in the active coordinates. In a two-layer ReLU embedding model, chain-rule estimates further show that the classifier head can receive larger effective gradients than the embedding block under controlled downstream norms. This supports a two-stage mechanism in which the classifier fits first, while the representation continues to simplify later. We use modular addition as the main experimental setting. The deep linear theory provides the rigorous core of the analysis. But the ReLU results are formulated as conditional reductions that account for empirical behavior without claiming a global proof for nonlinear training dynamics.

Problem

Research questions and friction points this paper is trying to address.

grokking

training clocks

representation simplification

deep linear networks

ReLU MLPs

Innovation

Methods, ideas, or system contributions that make the work stand out.

two training clocks

deep linear networks

conditional ReLU reduction