Data Synthesis and Parameter-Efficient Fine-Tuning for Low-Resource NMT: A Case Study on Q'eqchi' Mayan

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the challenges of neural machine translation for extremely low-resource Indigenous languages—exemplified by Q’eqchi’ Mayan—where parallel corpora are scarce and data sovereignty must be respected. The authors propose a synthetic data generation approach that eschews web crawling, instead leveraging community-provided dictionaries to create large-scale synthetic corpora, followed by parameter-efficient fine-tuning of mT5-base using LoRA. Experiments show strong in-domain performance with a BLEU score of 42.02, indicating successful acquisition of morphological and word-order structures; however, performance collapses to 0.59 on authentic test data, revealing severe limitations in semantic generalization and overfitting to synthetic patterns. These findings underscore the necessity of integrating real-world data through curriculum learning in subsequent stages. The work establishes a novel paradigm for low-resource NMT that balances ethical data practices with structural linguistic modeling.

📝 Abstract

Neural machine translation for digitally low-resource Indigenous languages is often hindered by extreme data scarcity, prompting reliance on extractive web-scraping. To ensure data sovereignty, this study introduces a data synthesis methodology to bootstrap NMT models without scraping target-language parallel text. Focusing on Q'eqchi' Mayan, we transformed community-sourced dictionaries into a massive synthetic corpus, utilizing Parameter-Efficient Fine-Tuning (PEFT) via LoRA adapters on an mT5-base model. In-domain evaluation demonstrates high structural acquisition (BLEU 42.02), proving that synthetic constraints effectively teach complex agglutinative morphology and VOS word order. However, evaluation against an organic glossary reveals a structural-semantic gap (BLEU 0.59), where the model maintains grammatical integrity but lacks the lexical grounding of natural language. The model exhibits overfitting to the constrained structural variance of the synthetic templates; despite high semantic entropy in the pipeline, it struggles with the syntactic fluidity of natural language, forcing organic inputs into rigid learned patterns. Furthermore, an ablation study utilizing a Multi-Task Learning architecture resulted in negative transfer, suggesting that auxiliary tasks competed for limited parameter capacity within the LoRA adapters, causing over-optimization for synthetic markers at the expense of organic flexibility. Ultimately, we establish that synthetic bootstrapping is a highly effective structural primer, but requires authentic data for semantic refinement via Curriculum Learning.

Problem

Research questions and friction points this paper is trying to address.

low-resource NMT

data scarcity

Indigenous languages

data sovereignty

synthetic data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Data Synthesis

Parameter-Efficient Fine-Tuning

LoRA