Gamayun's Path to Multilingual Mastery: Cost-Efficient Training of a 1.5B-Parameter LLM

📅 2025-12-25

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the research gap in resource-efficient, non-English-centric small multilingual large language models (LLMs), this paper introduces Gamayun, a 1.5B-parameter model. We propose a novel two-stage pretraining strategy: (1) a multilingual alignment stage using balanced multilingual corpora and language-aware token weighting; and (2) a transfer-enhancement stage incorporating high-quality English data to boost cross-lingual generalization under extremely low training budgets. Gamayun is trained from scratch, integrating curriculum-based sampling and fine-grained multilingual evaluation. On 12-language benchmarks, it consistently outperforms LLaMA3.2-1B and Qwen2.5-1.5B; achieves state-of-the-art performance on Russian MERA among models of comparable scale; and matches or exceeds the performance of the significantly larger Qwen3 (trained on 36T tokens) across most tasks—despite its substantially smaller size and training cost.

Technology Category

Application Category

📝 Abstract

We present Gamayun, a 1.5B-parameter multilingual language model trained entirely from scratch on 2.5T tokens. Designed for efficiency and deployment in resource-constrained environments, Gamayun addresses the lack of research on small non-English-centric LLMs by adopting a novel two-stage pre-training strategy: balanced multilingual training for cross-lingual alignment, followed by high-quality English enrichment to transfer performance gains across languages. Our model supports 12 languages, with special focus on Russian. Despite a significantly smaller training budget than comparable models, Gamayun outperforms LLaMA3.2-1B (9T tokens) on all considered benchmarks, and surpasses Qwen2.5-1.5B (18T tokens) on a wide range of English and multilingual tasks. It matches or exceeds Qwen3 (36T tokens) on most tasks outside advanced STEM, achieving state-of-the-art results in Russian, including the MERA benchmark, among the models of comparable size (1-2B parameters).

Problem

Research questions and friction points this paper is trying to address.

Addresses lack of small non-English-centric multilingual LLMs

Trains cost-efficient 1.5B-parameter model for resource-constrained environments

Enhances cross-lingual performance, especially in Russian, with limited budget

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage pre-training with multilingual then English enrichment

Focus on Russian and 12 languages for resource efficiency

Outperforms larger models with smaller training budget

🔎 Similar Papers

No similar papers found.

Authors to Follow