Learning Dynamics of Meta-Learning in Small Model Pretraining

📅 2025-08-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost and poor interpretability of large language model pretraining, this work introduces meta-learning—specifically first-order MAML—into the pretraining of small language models (LLaMA-style decoder-only architectures), coupled with a novel subset masked language modeling (Subset-MLM) objective. By dynamically monitoring effective rank and attention head entropy, we uncover a two-phase representation evolution pattern—“differentiation → compression”—and characterize layer-wise specialization timelines. Experiments demonstrate that our method achieves up to 1.6× speedup over standard pretraining while attaining equivalent loss, and yields significant F1 improvements on multilingual named entity recognition. Crucially, this is the first work to enable interpretable, real-time visualization of training dynamics in small models, establishing a new paradigm for efficient and transparent small-model pretraining.

Technology Category

Application Category

📝 Abstract
Large language models are powerful but costly. We ask whether meta-learning can make the pretraining of small language models not only better but also more interpretable. We integrate first-order MAML with subset-masked LM pretraining, producing four LLama-style decoder-only models (11M-570M params), and evaluate it on a fundamental NLP task with many settings and real-world applications. Compared with vanilla training, our model (i) reaches the same loss up to 1.6x sooner, (ii) improves F1 on multilingual Universal NER under equal compute, and (iii) makes the training dynamics easy to read: first the network's representations fan out ("diversify") and later they collapse into a smaller, shared subspace ("compress"). This two-stage shift shows up as a rise-and-fall in both effective-rank curves and attention-head entropy. The same curves pinpoint which layers specialise earliest and which later reconverge, giving a compact, interpretable signature of meta-adaptation. Code, checkpoints and WandB logs are released.
Problem

Research questions and friction points this paper is trying to address.

Enhancing small language model pretraining via meta-learning
Improving interpretability of training dynamics in meta-learning
Optimizing multilingual NER performance with efficient meta-adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates first-order MAML with subset-masked LM
Uses meta-learning for faster small model pretraining
Analyzes training dynamics via rank and entropy