🤖 AI Summary
This study addresses the low-resource Kazakh–Russian code-switching machine translation (CSMT) task by proposing a fully unsupervised end-to-end modeling approach. Confronted with the dual challenges of scarce annotated parallel corpora and pervasive code-mixing, we first construct the first publicly available Kazakh–Russian code-switching parallel corpus and design a synthetic data generation framework integrating back-translation and large language models (LLMs). Our method combines unsupervised neural machine translation (UNMT) with domain-adaptive fine-tuning. The key contribution lies in generating high-quality, code-mixed training data without human annotation—thereby overcoming both low-resource and code-mixing bottlenecks. Experimental results show that our model achieves 16.48 BLEU on standard test sets, approaching the performance of commercial systems, and significantly outperforms them in human evaluation.
📝 Abstract
Machine translation for low resource language pairs is a challenging task. This task could become extremely difficult once a speaker uses code switching. We propose a method to build a machine translation model for code-switched Kazakh-Russian language pair with no labeled data. Our method is basing on generation of synthetic data. Additionally, we present the first codeswitching Kazakh-Russian parallel corpus and the evaluation results, which include a model achieving 16.48 BLEU almost reaching an existing commercial system and beating it by human evaluation.