XDoGE: Multilingual Data Reweighting to Enhance Language Inclusivity in LLMs

📅 2025-12-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the performance degradation of large language models (LLMs) on mid- and low-resource languages—caused by training data over-representation of high-resource languages like English—this paper proposes XDoGE, a multilingual data reweighting framework. XDoGE is the first extension of the DoGE algorithm to multilingual settings, integrating domain-aware reweighting, distillation via a small surrogate model, and continual pretraining. It introduces a dual-path language-balancing strategy that jointly addresses deduplication and undersampling across six Iberian languages: English, Spanish, Portuguese, Catalan, Galician, and Basque. Evaluated on the IberoBench benchmark, the self-developed IberianLLM-7B-Instruct model achieves an average improvement of 12.3%, with particularly notable gains in understanding and generation capabilities for low-resource languages such as Basque and Galician. The framework and model are open-sourced, empirically validating the critical role of language-specific weighting in enhancing LLM fairness and cross-lingual generalization.

Technology Category

Application Category

📝 Abstract
Current large language models (LLMs) are trained on massive amounts of text data, primarily from a few dominant languages. Studies suggest that this over-reliance on high-resource languages, such as English, hampers LLM performance in mid- and low-resource languages. To mitigate this problem, we propose to (i) optimize the language distribution by training a small proxy model within a domain-reweighing DoGE algorithm that we extend to XDoGE for a multilingual setup, and (ii) rescale the data and train a full-size model with the established language weights either from scratch or within a continual pre-training phase (CPT). We target six languages possessing a variety of geographic and intra- and inter-language-family relations, namely, English and Spanish (high-resource), Portuguese and Catalan (mid-resource), Galician and Basque (low-resource). We experiment with Salamandra-2b, which is a promising model for these languages. We investigate the effects of substantial data repetition on minor languages and under-sampling on dominant languages using the IberoBench framework for quantitative evaluation. Finally, we release a new promising IberianLLM-7B-Instruct model centering on Iberian languages and English that we pretrained from scratch and further improved using CPT with the XDoGE weights.
Problem

Research questions and friction points this paper is trying to address.

Optimizes language distribution to enhance LLM inclusivity
Mitigates over-reliance on high-resource languages like English
Improves performance for mid- and low-resource languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends DoGE to XDoGE for multilingual data reweighting
Rescales data and trains full model with optimized language weights
Applies continual pre-training with XDoGE weights for improved inclusivity
🔎 Similar Papers
No similar papers found.