Optimal Corpus Aware Training for Neural Machine Translation

📅 2025-08-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the limitations of corpus-aware training (CAT) in neural machine translation—namely, its reliance on manually predefined high-quality data groups, which is error-prone and inefficient—this paper proposes Optimal Corpus-Aware Training (OCAT). OCAT enhances the model’s capacity to capture data heterogeneity by injecting corpus-level metadata (e.g., quality scores and domain labels) and enables flexible, inference-time adaptation to corpus preferences. Crucially, it freezes the backbone parameters and fine-tunes only a lightweight corpus-adaptation module, thereby substantially mitigating overfitting and reducing hyperparameter sensitivity. Evaluated on WMT23 English→Chinese and English→German translation tasks, OCAT achieves +3.6 and +1.8 improvements in chrF score, respectively, outperforming standard fine-tuning and matching or exceeding state-of-the-art parameter-efficient fine-tuning methods.

Technology Category

Application Category

📝 Abstract

Corpus Aware Training (CAT) leverages valuable corpus metadata during training by injecting corpus information into each training example, and has been found effective in the literature, commonly known as the "tagging" approach. Models trained with CAT inherently learn the quality, domain and nuance between corpora directly from data, and can easily switch to different inference behavior. To achieve the best evaluation, CAT models pre-define a group of high quality data before training starts which can be error-prone and inefficient. In this work, we propose Optimal Corpus Aware Training (OCAT), which fine-tunes a CAT pre-trained model by freezing most of the model parameters and only tuning small set of corpus-related parameters. We show that OCAT is lightweight, resilient to overfitting, and effective in boosting model accuracy. We use WMT23 English to Chinese and English to German translation tasks as our test ground and show +3.6 and +1.8 chrF improvement, respectively, over vanilla training. Furthermore, our approach is on-par or slightly better than other state-of-the-art fine-tuning techniques while being less sensitive to hyperparameter settings.

Problem

Research questions and friction points this paper is trying to address.

Optimizing corpus metadata usage in translation training

Reducing errors in pre-defining high-quality training data

Improving model accuracy with lightweight fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tunes CAT model with frozen parameters

Optimizes corpus-related parameters only

Lightweight, resilient, and accuracy-boosting

🔎 Similar Papers

No similar papers found.

Authors to Follow