Model-Based Quality Assessment for Massively Multilingual Parallel Data

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This study addresses the challenges posed by noisy, non-parallel sentence pairs and low-quality translations prevalent in large-scale multilingual parallel corpora, as well as the absence of a unified, direction-aware evaluation framework. The authors decouple quality assessment into two distinct tasks: parallelism detection using multilingual embeddings and reference-free quality estimation employing reference-free evaluators. They further introduce a direction-aware evaluation routing mechanism to adaptively select appropriate assessment strategies. Comprehensive experiments on datasets such as FLORES-200 and BOUQuET evaluate four embedding models and nine quality estimators across diverse language directions. Results reveal that no single metric generalizes effectively across all directions, performance varies substantially by translation direction, naive ensembles dilute strong model signals, and evaluator scores correlate strongly with target-language coverage. The findings underscore the necessity of tailoring evaluation strategies to specific language directions.

📝 Abstract

Large-scale multilingual bitext often contains two distinct problems: non-parallel sentence pairs and low-quality translations. We decompose model-based assessment for such data into two independent components: parallelism assessment with multilingual embeddings and reference-free quality estimation (QE). For parallelism, we benchmark four embedding models on FLORES-200 and BOUQuET retrieval tasks, covering 6,654 source--target directions in our target language-pair inventory. For QE, we evaluate nine reference-free evaluators on professional FLORES-200 translations across 41,412 ordered source--target directions. Results show that no model is universally reliable across translation directions. Naive QE ensembles dilute strong model signals, while documented target-language coverage is strongly associated with higher QE scores. Overall, these findings suggest that multilingual parallel-data assessment is best approached as a direction-aware routing and calibration problem, where no single universal metric is expected to suffice across all languages.

Problem

Research questions and friction points this paper is trying to address.

multilingual parallel data

quality assessment

non-parallel sentence pairs

low-quality translations

direction-aware evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

model-based quality assessment

multilingual parallelism

reference-free QE

direction-aware routing

multilingual embeddings

🔎 Similar Papers

Selected Languages are All You Need for Cross-lingual Truthfulness Transfer

2024-06-20International Conference on Computational LinguisticsCitations: 2