Mask and You Shall Receive: Optimizing Masked Language Modeling For Pretraining BabyLMs

📅 2025-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Small-scale language models (e.g., BabyLM) suffer from fixed masking strategies and weak morphological generalization in masked language modeling (MLM) pretraining. Method: This paper proposes a dynamic masking probability mechanism—adjusting the masking rate in real time based on prediction difficulty—and subword-level embedding enhancement to explicitly model morphological variation. Contribution/Results: The approach jointly improves lexical morphology and contextual semantics modeling without compromising training efficiency. Evaluated within the BabyLM challenge framework, it consistently outperforms standard MLM baselines across multiple (Super)GLUE tasks. Notably, it achieves substantial performance gains in the strictly constrained small-model track, demonstrating that adaptive masking and fine-grained morphological representation are critical for enhancing language understanding in resource-limited models.

Technology Category

Application Category

📝 Abstract
We describe our strategy for the 2025 edition of the BabyLM Challenge. Our main contribution is that of an improved form of Masked Language Modeling (MLM), which adapts the probabilities of the tokens masked according to the model's ability to predict them. The results show a substantial increase in performance on (Super)GLUE tasks over the standard MLM. We also incorporate sub-token embeddings, finding that this increases the model's morphological generalization capabilities. Our submission beats the baseline in the strict-small track.
Problem

Research questions and friction points this paper is trying to address.

Optimizing masked token probabilities for improved language modeling
Enhancing morphological generalization through sub-token embeddings
Boosting performance on GLUE benchmarks over standard MLM
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts masking probabilities based on prediction difficulty
Incorporates sub-token embeddings for morphological generalization
Optimizes masked language modeling for pretraining efficiency
🔎 Similar Papers
No similar papers found.