🤖 AI Summary
This study addresses domain adaptation for machine translation in low-resource languages within specialized domains (e.g., medicine), using only minimal religious parallel corpora (the Bible), bilingual dictionaries, and target-domain monolingual data. We propose and systematically evaluate lightweight domain adaptation techniques, and—under realistic low-resource conditions—provide the first empirical validation of the DALI (Dictionary-Augmented Language Modeling with Inversion) framework. DALI integrates dictionary-guided lexical substitution, back-translation, and target-domain monolingual fine-tuning. It achieves statistically significant improvements over strong baselines in both BLEU scores and human evaluation. Our analysis reveals common limitations of existing approaches in semantic alignment and domain transfer, and demonstrates that dictionary-augmented language modeling effectively mitigates terminology inaccuracy and contextual bias in low-resource domain translation. The work establishes a scalable, resource-efficient paradigm for professional translation under severe data constraints.
📝 Abstract
Many of the world's languages have insufficient data to train high-performing general neural machine translation (NMT) models, let alone domain-specific models, and often the only available parallel data are small amounts of religious texts. Hence, domain adaptation (DA) is a crucial issue faced by contemporary NMT and has, so far, been underexplored for low-resource languages. In this paper, we evaluate a set of methods from both low-resource NMT and DA in a realistic setting, in which we aim to translate between a high-resource and a low-resource language with access to only: a) parallel Bible data, b) a bilingual dictionary, and c) a monolingual target-domain corpus in the high-resource language. Our results show that the effectiveness of the tested methods varies, with the simplest one, DALI, being most effective. We follow up with a small human evaluation of DALI, which shows that there is still a need for more careful investigation of how to accomplish DA for low-resource NMT.