What Happens During the Loss Plateau? Understanding Abrupt Learning in Transformers

📅 2025-06-16

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This work investigates the ubiquitous “loss plateau—abrupt learning” phenomenon observed during algorithmic task training with shallow Transformers. To diagnose its underlying causes, we employ attention visualization, geometric analysis of hidden states (via cosine similarity), controlled intervention experiments, and behavioral analysis of large models (Pythia, OLMo) during early pretraining. We identify slow optimization of the attention mechanism as the primary driver of the plateau; implicit attention refinement precedes performance jumps; and the plateau phase coincides with the coupled emergence of interpretable local solutions, output repetition bias, and representational collapse in hidden states. Crucially, we quantitatively characterize the co-evolution of collapse and bias for the first time, and demonstrate that targeted attention interventions effectively modulate both plateau duration and degradation severity. Our findings provide a novel mechanistic explanation for deep learning dynamics and establish an actionable analytical framework for studying training evolution.

Technology Category

Application Category

📝 Abstract

Training Transformers on algorithmic tasks frequently demonstrates an intriguing abrupt learning phenomenon: an extended performance plateau followed by a sudden, sharp improvement. This work investigates the underlying mechanisms for such dynamics, primarily in shallow Transformers. We reveal that during the plateau, the model often develops an interpretable partial solution while simultaneously exhibiting a strong repetition bias in their outputs. This output degeneracy is accompanied by internal representation collapse, where hidden states across different tokens become nearly parallel. We further identify the slow learning of optimal attention maps as a key bottleneck. Hidden progress in attention configuration during the plateau precedes the eventual rapid convergence, and directly intervening on attention significantly alters plateau duration and the severity of repetition bias and representational collapse. We validate that these identified phenomena-repetition bias and representation collapse-are not artifacts of toy setups but also manifest in the early pre-training stage of large language models like Pythia and OLMo.

Problem

Research questions and friction points this paper is trying to address.

Understanding abrupt learning dynamics in shallow Transformers

Investigating repetition bias and representation collapse during plateaus

Identifying attention map learning as a key bottleneck

Innovation

Methods, ideas, or system contributions that make the work stand out.

Investigates abrupt learning in shallow Transformers

Reveals hidden progress in attention configuration

Validates phenomena in large language models

🔎 Similar Papers

No similar papers found.