NSL-MT: Linguistically Informed Negative Samples for Efficient Machine Translation in Low-Resource Languages

📅 2025-11-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of parallel corpora and frequent grammatical errors in low-resource neural machine translation (NMT), this paper proposes NSL-MT, a linguistics-informed negative sampling learning framework. During training, NSL-MT constructs a negative space by synthetically generating target-language sentences violating grammatical constraints and introduces a severity-weighted negative sample loss to explicitly penalize illegal syntactic structures. Crucially, NSL-MT requires no additional annotations or external linguistic tools and is plug-and-play compatible with standard NMT architectures. Experiments across multiple low-resource language pairs demonstrate that NSL-MT improves BLEU scores by 3–89% and enhances data efficiency fivefold: using only 1,000 parallel sentence pairs, it matches the performance of baseline models trained on 5,000 pairs. This yields substantially improved grammatical robustness and generalization under few-shot settings.

Technology Category

Application Category

📝 Abstract
We introduce Negative Space Learning MT (NSL-MT), a training method that teaches models what not to generate by encoding linguistic constraints as severity-weighted penalties in the loss function. NSL-MT increases limited parallel data with synthetically generated violations of target language grammar, explicitly penalizing the model when it assigns high probability to these linguistically invalid outputs. We demonstrate that NSL-MT delivers improvements across all architectures: 3-12% BLEU gains for well-performing models and 56-89% gains for models lacking descent initial support. Furthermore, NSL-MT provides a 5x data efficiency multiplier -- training with 1,000 examples matches or exceeds normal training with 5,000 examples. Thus, NSL-MT provides a data-efficient alternative training method for settings where there is limited annotated parallel corporas.
Problem

Research questions and friction points this paper is trying to address.

Improves machine translation efficiency for low-resource languages
Uses linguistic constraints to penalize invalid grammatical outputs
Enables effective training with limited parallel data through negative samples
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses linguistic constraints as weighted penalties
Generates synthetic grammar violations as negative samples
Improves data efficiency fivefold for low-resource training
🔎 Similar Papers
No similar papers found.
M
Mamadou K. Keita
Rochester Institute of Technology
Christopher Homan
Christopher Homan
Rochester Institute of Technology
Computer Science
H
Huy Le
Rochester Institute of Technology