Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

📅 2025-02-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multilingual large language model (LLM) pretraining suffers from a lack of principled, fine-grained filtering methods for non-English data, leading to suboptimal data quality and coverage. Method: We propose the first transparent, lightweight, and scalable model-driven multilingual data selection framework, leveraging a dual-path classifier integrating Transformer-based semantic and FastText-based structural features, coupled with a cross-lingual controllable sampling strategy. The framework systematically covers 20 languages—including low-resource languages and scripts with multiple writing systems—balancing linguistic diversity and knowledge density. Contribution/Results: We release a high-quality, refined multilingual pretraining dataset. Experiments show that models trained on only 15% of the original token count achieve baseline MMLU performance; significant gains are observed on MMLU, XWinograd, and other multilingual benchmarks. Generalizability is further validated on the FineWeb-2 multilingual subset.

Technology Category

Application Category

📝 Abstract
Dataset curation has become a basis for strong large language model (LLM) performance. While various rule-based filtering heuristics exist for English and multilingual datasets, model-based filtering techniques have primarily focused on English. To address the disparity stemming from limited research on non-English languages, we propose a model-based filtering framework for multilingual datasets that aims to identify a diverse set of structured and knowledge-rich samples. Our approach emphasizes transparency, simplicity, and efficiency, leveraging Transformer- and FastText-based classifiers to ensure the broad accessibility of our technique and data. We conduct comprehensive ablation studies on the FineWeb-2 web crawl dataset across diverse language families, scripts, and resource availability to demonstrate the effectiveness of our method. Training a 1B-parameter Llama model for 70B and 119B tokens, our approach can match the baseline MMLU score with as little as 15% of the training tokens, while also improving across other benchmarks. These findings provide strong evidence for the generalizability of our approach to other languages. As a result, we extend our framework to 20 languages for which we release the refined pretraining datasets.
Problem

Research questions and friction points this paper is trying to address.

Enhancing multilingual LLM pretraining efficiency
Addressing data selection disparity in non-English languages
Improving dataset curation with model-based filtering techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Model-based filtering for multilingual datasets
Transformer- and FastText-based classifiers
Efficient training with reduced token usage
🔎 Similar Papers
No similar papers found.
Bettina Messmer
Bettina Messmer
EPFL
Machine LearningNeural Networks
V
Vinko Sabolcec
School of Computer and Communication Sciences, EPFL, Lausanne, Switzerland
Martin Jaggi
Martin Jaggi
EPFL
Machine LearningOptimization