Gamayun's Path to Multilingual Mastery: Cost-Efficient Training of a 1.5B-Parameter LLM

๐Ÿ“… 2025-12-25
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the research gap in resource-efficient, non-English-centric small multilingual large language models (LLMs), this paper introduces Gamayun, a 1.5B-parameter model. We propose a novel two-stage pretraining strategy: (1) a multilingual alignment stage using balanced multilingual corpora and language-aware token weighting; and (2) a transfer-enhancement stage incorporating high-quality English data to boost cross-lingual generalization under extremely low training budgets. Gamayun is trained from scratch, integrating curriculum-based sampling and fine-grained multilingual evaluation. On 12-language benchmarks, it consistently outperforms LLaMA3.2-1B and Qwen2.5-1.5B; achieves state-of-the-art performance on Russian MERA among models of comparable scale; and matches or exceeds the performance of the significantly larger Qwen3 (trained on 36T tokens) across most tasksโ€”despite its substantially smaller size and training cost.

Technology Category

Application Category

๐Ÿ“ Abstract
We present Gamayun, a 1.5B-parameter multilingual language model trained entirely from scratch on 2.5T tokens. Designed for efficiency and deployment in resource-constrained environments, Gamayun addresses the lack of research on small non-English-centric LLMs by adopting a novel two-stage pre-training strategy: balanced multilingual training for cross-lingual alignment, followed by high-quality English enrichment to transfer performance gains across languages. Our model supports 12 languages, with special focus on Russian. Despite a significantly smaller training budget than comparable models, Gamayun outperforms LLaMA3.2-1B (9T tokens) on all considered benchmarks, and surpasses Qwen2.5-1.5B (18T tokens) on a wide range of English and multilingual tasks. It matches or exceeds Qwen3 (36T tokens) on most tasks outside advanced STEM, achieving state-of-the-art results in Russian, including the MERA benchmark, among the models of comparable size (1-2B parameters).
Problem

Research questions and friction points this paper is trying to address.

Addresses lack of small non-English-centric multilingual LLMs
Trains cost-efficient 1.5B-parameter model for resource-constrained environments
Enhances cross-lingual performance, especially in Russian, with limited budget
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage pre-training with multilingual then English enrichment
Focus on Russian and 12 languages for resource efficiency
Outperforms larger models with smaller training budget
๐Ÿ”Ž Similar Papers
No similar papers found.
A
Alexander Podolskiy
S
Semen Molokov
T
Timofey Gerasin
M
Maksim Titov
A
Alexey Rukhovich
A
Artem Khrapov
Kirill Morozov
Kirill Morozov
University of North Texas, USA
CryptographyCybersecurity
E
Evgeny Tetin
C
Constantine Korikov
P
Pavel Efimov
P
Polina Lazukova
Y
Yuliya Skripkar
N
Nikita Okhotnikov
Irina Piontkovskaya
Irina Piontkovskaya
Huawei Noah's Ark Lab
natural language processing
M
Meng Xiaojun
Z
Zou Xueyi
Z
Zhang Zhenhe