๐ค AI Summary
To address the research gap in resource-efficient, non-English-centric small multilingual large language models (LLMs), this paper introduces Gamayun, a 1.5B-parameter model. We propose a novel two-stage pretraining strategy: (1) a multilingual alignment stage using balanced multilingual corpora and language-aware token weighting; and (2) a transfer-enhancement stage incorporating high-quality English data to boost cross-lingual generalization under extremely low training budgets. Gamayun is trained from scratch, integrating curriculum-based sampling and fine-grained multilingual evaluation. On 12-language benchmarks, it consistently outperforms LLaMA3.2-1B and Qwen2.5-1.5B; achieves state-of-the-art performance on Russian MERA among models of comparable scale; and matches or exceeds the performance of the significantly larger Qwen3 (trained on 36T tokens) across most tasksโdespite its substantially smaller size and training cost.
๐ Abstract
We present Gamayun, a 1.5B-parameter multilingual language model trained entirely from scratch on 2.5T tokens. Designed for efficiency and deployment in resource-constrained environments, Gamayun addresses the lack of research on small non-English-centric LLMs by adopting a novel two-stage pre-training strategy: balanced multilingual training for cross-lingual alignment, followed by high-quality English enrichment to transfer performance gains across languages. Our model supports 12 languages, with special focus on Russian. Despite a significantly smaller training budget than comparable models, Gamayun outperforms LLaMA3.2-1B (9T tokens) on all considered benchmarks, and surpasses Qwen2.5-1.5B (18T tokens) on a wide range of English and multilingual tasks. It matches or exceeds Qwen3 (36T tokens) on most tasks outside advanced STEM, achieving state-of-the-art results in Russian, including the MERA benchmark, among the models of comparable size (1-2B parameters).