Low-resource Machine Translation for Code-switched Kazakh-Russian Language Pair

📅 2025-03-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the low-resource Kazakh–Russian code-switching machine translation (CSMT) task by proposing a fully unsupervised end-to-end modeling approach. Confronted with the dual challenges of scarce annotated parallel corpora and pervasive code-mixing, we first construct the first publicly available Kazakh–Russian code-switching parallel corpus and design a synthetic data generation framework integrating back-translation and large language models (LLMs). Our method combines unsupervised neural machine translation (UNMT) with domain-adaptive fine-tuning. The key contribution lies in generating high-quality, code-mixed training data without human annotation—thereby overcoming both low-resource and code-mixing bottlenecks. Experimental results show that our model achieves 16.48 BLEU on standard test sets, approaching the performance of commercial systems, and significantly outperforms them in human evaluation.

Technology Category

Application Category

📝 Abstract
Machine translation for low resource language pairs is a challenging task. This task could become extremely difficult once a speaker uses code switching. We propose a method to build a machine translation model for code-switched Kazakh-Russian language pair with no labeled data. Our method is basing on generation of synthetic data. Additionally, we present the first codeswitching Kazakh-Russian parallel corpus and the evaluation results, which include a model achieving 16.48 BLEU almost reaching an existing commercial system and beating it by human evaluation.
Problem

Research questions and friction points this paper is trying to address.

Develop machine translation for low-resource Kazakh-Russian code-switching
Create synthetic data to train without labeled examples
Build first parallel corpus and evaluate against commercial systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic data generation for low-resource translation
First Kazakh-Russian codeswitching parallel corpus
BLEU-competitive model without labeled data
🔎 Similar Papers
No similar papers found.