Parameter Alignment Mitigates Catastrophic Forgetting in Multilingual Expert Language Models

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This work addresses catastrophic forgetting in multilingual continual pretraining, a phenomenon often triggered by parameter drift that degrades model performance on previously acquired languages and general knowledge. The study establishes, for the first time, a systematic connection between parameter alignment and forgetting mitigation, introducing five hierarchy-aware alignment strategies—ranging from hard layer freezing and soft regularization to post-hoc weight rollback and model merging—and provides a task-oriented guideline for strategy selection. Comprehensive evaluation across 32 languages demonstrates that the proposed approaches substantially alleviate forgetting while imposing minimal interference on new language acquisition: layer freezing and regularization best preserve reading comprehension capabilities, whereas post-hoc rollback yields the greatest gains in translation performance.

📝 Abstract

While continual pretraining~(CPT) is a practical way to extend large language models to new languages, naïve finetuning on targeted data erodes existing capabilities through catastrophic forgetting. Organizing training around language families reduces cross-language interference but cannot alone prevent forgetting of the general knowledge needed for downstream tasks. We link this forgetting to parameter drift in multilingual CPT and present a suite of five layer-aware parameter alignment strategies: hard layer freezing, soft regularization, post-hoc weight reversion, and model merging. We systematically compare our alignment strategies against two unregularized CPT baselines on benchmarks spanning 32 training languages from five language families, plus held-out languages, across four evaluation axes: perplexity, reading comprehension, physical reasoning, and translation. Parameter alignment substantially reduces forgetting at minimal cost to language acquisition: layer freezing and regularization best preserve comprehension, whereas post-hoc reversion yields the strongest translation gains. Together, these results map the acquisition--forgetting frontier for family-expert CPT and offer practical deployment guidelines pairing each strategy to the tasks it best serves.

Problem

Research questions and friction points this paper is trying to address.

catastrophic forgetting

multilingual language models

continual pretraining

parameter drift

language acquisition

Innovation

Methods, ideas, or system contributions that make the work stand out.

parameter alignment

catastrophic forgetting

continual pretraining