Language Models "Grok" to Copy

📅 2024-09-14
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the emergence mechanism of context-text copying capability in large language model (LLM) pretraining—a foundational ability underpinning in-context learning (ICL) and retrieval-augmented generation (RAG). We observe that copying emerges with a “grokking”-like pattern: it appears significantly after training loss has plateaued, then rapidly saturates. For the first time, we identify three shared characteristics between copying emergence and grokking: (i) temporal lag relative to loss reduction, (ii) independence from dataset scale, and (iii) progressive formation of induction heads—from shallow to deep layers. Leveraging Transformer-based analysis, we monitor pretraining dynamics, localize induction heads, and conduct regularization interventions. Experiments confirm that grokking-promoting techniques—particularly regularization—accelerate and strengthen copying emergence. Our findings establish an interpretable, intervention-aware training paradigm for enhancing ICL and RAG performance.

Technology Category

Application Category

📝 Abstract
We examine the pre-training dynamics of language models, focusing on their ability to copy text from preceding context--a fundamental skill for various LLM applications, including in-context learning (ICL) and retrieval-augmented generation (RAG). We propose a novel perspective that Transformer-based language models develop copying abilities similarly to grokking, which refers to sudden generalization on test set long after the model fit to the training set. Our experiments yield three arguments: (1) The pre-training loss decreases rapidly, while the context copying ability of models initially lags and then abruptly saturates. (2) The speed of developing copying ability is independent of the number of tokens trained, similarly to how grokking speed is unaffected by dataset size as long as the data distribution is preserved. (3) Induction heads, the attention heads responsible for copying, form from shallow to deep layers during training, mirroring the development of circuits in deeper layers during grokking. We contend that the connection between grokking and context copying can provide valuable insights for more effective language model training, ultimately improving in-context performance. For example, we demonstrated that techniques that enhance grokking, such as regularization, either accelerate or enhance the development of context copying.
Problem

Research questions and friction points this paper is trying to address.

language models copy text
grokking influences model training
induction heads develop copying ability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based language models
grokking for copying
induction heads development
🔎 Similar Papers
No similar papers found.