A Pre-Training Analogue of Grokking in Language Models: Tracing Delayed Grammatical Generalization

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This study investigates whether large language models exhibit delayed syntactic generalization akin to “grokking” during unsupervised pretraining. By constructing proxy training and validation splits based on corpus exposure and leveraging the BLiMP minimal-pair benchmark, the work provides the first empirical evidence of such grokking-like behavior in an unsupervised setting. The authors introduce novel techniques—including key phrase identification and high-dimensional subspace analysis—and demonstrate across five syntactic constructions that after the onset of generalization, models markedly enhance their ability to predict syntactic concept vectors. These representations occupy higher-dimensional subspaces, and critical contextual attention becomes concentrated in a small subset of attention heads.

📝 Abstract

Grokking, the phenomenon in which neural networks generalize long after fitting their training data, has been studied in supervised settings on many epochs. LLM pre-training instead involves next-token prediction over an unlabeled corpus, with limited data repetition and no explicit train/validation split. To address this, we propose an exposure-based framework that enables the study of grokking-like dynamics during LLM pre-training. We ground our evaluation in BLiMP minimal pairs, which provide controlled grammatical contrasts. For every BLiMP minimal pair, we identify a critical phrase, the smallest continuous span that captures the grammatical contrast and the phenomenon-relevant context. Examples whose critical phrase appears in the pre-training window are assigned to the proxy-train split; the remaining examples are assigned to the proxy-validation split. Across five grammatical phenomena, we observe delayed generalization. Analyzing pre-training checkpoints before and after generalization shows that grammatical concept vectors become more predictive of grammatical acceptability and occupy a higher-dimensional subspace after generalization. We also find that attention from the critical token to the relevant context token is concentrated in a small number of heads.

Problem

Research questions and friction points this paper is trying to address.

grokking

language models

pre-training

grammatical generalization

delayed generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

grokking

pre-training

grammatical generalization