🤖 AI Summary
This study addresses the issue of inflated performance in large language models (LLMs) on multilingual machine translation benchmarks due to training data contamination, which can propagate across translation directions and obscure true generalization capabilities. Using FLORES-200 as a diagnostic benchmark, the authors compare the contaminated Bloomz model with the uncontaminated Llama model, revealing—for the first time—a cross-directional memorization mechanism. By applying source-side perturbations, such as paraphrasing and named entity substitution, combined with BLEU score analysis, they quantify the models’ reliance on memorized translations. Experimental results demonstrate that Bloomz exhibits artificially enhanced performance on unseen translation directions, while named entity substitution substantially reduces its BLEU scores, confirming the effectiveness of this approach in detecting memorization-driven behavior.
📝 Abstract
Large language models (LLMs) can be benchmark-contaminated, resulting in inflated scores that mask memorization as generalization, and in multilingual settings, this memorization can even transfer to"uncontaminated"languages. Using the FLORES-200 translation benchmark as a diagnostic, we study two 7-8B instruction-tuned multilingual LLMs: Bloomz, which was trained on FLORES, and Llama as an uncontaminated control. We confirm Bloomz's FLORES contamination and demonstrate that machine translation contamination can be cross-directional, artificially boosting performance in unseen translation directions due to target-side memorization. Further analysis shows that recall of memorized references often persists despite various source-side perturbation efforts like paraphrasing and named entity replacement. However, replacing named entities leads to a consistent decrease in BLEU, suggesting an effective probing method for memorization in contaminated models.