EuroLLM-9B: Technical Report

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

To address the severe underrepresentation of European languages—particularly low-resource ones—in existing open-source large language models (LLMs), this work introduces the first from-scratch trained, fully open-source multilingual LLM supporting all 24 official EU languages plus 11 additional European languages. We propose EuroFilter, a novel multilingual data filtering framework, and EuroBlocks-Synthetic, a high-quality synthetic multilingual dataset. Our methodology integrates a custom tokenizer, multi-stage data cleaning, synthetic data augmentation, instruction fine-tuning, and explicit multilingual alignment training. Evaluated on multilingual understanding and machine translation benchmarks, the model achieves state-of-the-art performance among open-source LLMs. All model weights, the EuroFilter source code, and the EuroBlocks-Synthetic dataset are publicly released under permissive open licenses to foster reproducible research and equitable multilingual AI development.

Technology Category

Application Category

📝 Abstract

This report presents EuroLLM-9B, a large language model trained from scratch to support the needs of European citizens by covering all 24 official European Union languages and 11 additional languages. EuroLLM addresses the issue of European languages being underrepresented and underserved in existing open large language models. We provide a comprehensive overview of EuroLLM-9B's development, including tokenizer design, architectural specifications, data filtering, and training procedures. We describe the pre-training data collection and filtering pipeline, including the creation of EuroFilter, an AI-based multilingual filter, as well as the design of EuroBlocks-Synthetic, a novel synthetic dataset for post-training that enhances language coverage for European languages. Evaluation results demonstrate EuroLLM-9B's competitive performance on multilingual benchmarks and machine translation tasks, establishing it as the leading open European-made LLM of its size. To support open research and adoption, we release all major components of this work, including the base and instruction-tuned models, the EuroFilter classifier, and the synthetic post-training dataset.

Problem

Research questions and friction points this paper is trying to address.

Addresses underrepresentation of European languages in open LLMs

Develops a multilingual model covering 24 EU and 11 extra languages

Enhances language coverage via synthetic data and AI filtering

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training a multilingual model from scratch

Creating AI-based multilingual data filter

Developing synthetic dataset for post-training

🔎 Similar Papers

Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs

2024-09-30arXiv.orgCitations: 8