Is Grokking a Computational Glass Relaxation?

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work investigates “grokking”—the counterintuitive late-stage emergence of generalization in neural network training. Contrary to the prevailing first-order phase transition interpretation, we posit that grokking is fundamentally a computational glassy relaxation process: models rapidly fall into low-energy, low-entropy memorization states, then slowly relax into high-entropy, high-generalization steady states. Using Wang–Landau sampling and Boltzmann entropy landscape analysis on Transformer-based arithmetic tasks, we empirically verify that high-entropy solutions exhibit substantially superior generalization and rigorously falsify the “golden zone norm evolution” hypothesis. We further propose WanD, a novel optimizer that bypasses slow relaxation by directly converging to high-entropy generalizing solutions. WanD achieves unconstrained grokking elimination and consistently improves generalization performance beyond standard training.

Technology Category

Application Category

📝 Abstract

Understanding neural network's (NN) generalizability remains a central question in deep learning research. The special phenomenon of grokking, where NNs abruptly generalize long after the training performance reaches a near-perfect level, offers a unique window to investigate the underlying mechanisms of NNs' generalizability. Here we propose an interpretation for grokking by framing it as a computational glass relaxation: viewing NNs as a physical system where parameters are the degrees of freedom and train loss is the system energy, we find memorization process resembles a rapid cooling of liquid into non-equilibrium glassy state at low temperature and the later generalization is like a slow relaxation towards a more stable configuration. This mapping enables us to sample NNs' Boltzmann entropy (states of density) landscape as a function of training loss and test accuracy. Our experiments in transformers on arithmetic tasks suggests that there is NO entropy barrier in the memorization-to-generalization transition of grokking, challenging previous theory that defines grokking as a first-order phase transition. We identify a high-entropy advantage under grokking, an extension of prior work linking entropy to generalizability but much more significant. Inspired by grokking's far-from-equilibrium nature, we develop a toy optimizer WanD based on Wang-landau molecular dynamics, which can eliminate grokking without any constraints and find high-norm generalizing solutions. This provides strictly-defined counterexamples to theory attributing grokking solely to weight norm evolution towards the Goldilocks zone and also suggests new potential ways for optimizer design.

Problem

Research questions and friction points this paper is trying to address.

Understanding neural networks' abrupt generalization phenomenon called grokking

Framing grokking as computational glass relaxation to study NN generalizability

Challenging existing theories by showing no entropy barrier in grokking transitions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Framing grokking as computational glass relaxation

Sampling Boltzmann entropy landscape in NNs

Developing WanD optimizer from Wang-landau dynamics

🔎 Similar Papers

No similar papers found.

Authors to Follow