🤖 AI Summary
This work addresses catastrophic forgetting in multilingual continual pretraining, a phenomenon often triggered by parameter drift that degrades model performance on previously acquired languages and general knowledge. The study establishes, for the first time, a systematic connection between parameter alignment and forgetting mitigation, introducing five hierarchy-aware alignment strategies—ranging from hard layer freezing and soft regularization to post-hoc weight rollback and model merging—and provides a task-oriented guideline for strategy selection. Comprehensive evaluation across 32 languages demonstrates that the proposed approaches substantially alleviate forgetting while imposing minimal interference on new language acquisition: layer freezing and regularization best preserve reading comprehension capabilities, whereas post-hoc rollback yields the greatest gains in translation performance.
📝 Abstract
While continual pretraining~(CPT) is a practical way to extend large language models to new languages, naïve finetuning on targeted data erodes existing capabilities through catastrophic forgetting. Organizing training around language families reduces cross-language interference but cannot alone prevent forgetting of the general knowledge needed for downstream tasks. We link this forgetting to parameter drift in multilingual CPT and present a suite of five layer-aware parameter alignment strategies: hard layer freezing, soft regularization, post-hoc weight reversion, and model merging. We systematically compare our alignment strategies against two unregularized CPT baselines on benchmarks spanning 32 training languages from five language families, plus held-out languages, across four evaluation axes: perplexity, reading comprehension, physical reasoning, and translation. Parameter alignment substantially reduces forgetting at minimal cost to language acquisition: layer freezing and regularization best preserve comprehension, whereas post-hoc reversion yields the strongest translation gains. Together, these results map the acquisition--forgetting frontier for family-expert CPT and offer practical deployment guidelines pairing each strategy to the tasks it best serves.