Ready to Translate, Not to Represent? Bias and Performance Gaps in Multilingual LLMs Across Language Families and Domains

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Large language models (LLMs) exhibit imbalanced performance in machine translation (MT) across language families and domains, particularly amplifying societal biases present in training data for low-resource languages—thereby compromising translation fairness. To address this, we propose Translation Tangles, a hybrid bias detection framework integrating rule-based heuristics, semantic similarity filtering, and LLM-based validation. We introduce the first high-quality, human-annotated dataset for MT fairness evaluation, comprising 1,439 translation-reference pairs across 24 bilingual directions and diverse domains. Furthermore, we design a unified evaluation paradigm that jointly incorporates multi-metric benchmarking, semantic similarity computation, LLM-based verification, and human assessment. All code, data, and evaluation tools are publicly released to foster reproducible research and community advancement in fair MT.

Technology Category

Application Category

📝 Abstract

The rise of Large Language Models (LLMs) has redefined Machine Translation (MT), enabling context-aware and fluent translations across hundreds of languages and textual domains. Despite their remarkable capabilities, LLMs often exhibit uneven performance across language families and specialized domains. Moreover, recent evidence reveals that these models can encode and amplify different biases present in their training data, posing serious concerns for fairness, especially in low-resource languages. To address these gaps, we introduce Translation Tangles, a unified framework and dataset for evaluating the translation quality and fairness of open-source LLMs. Our approach benchmarks 24 bidirectional language pairs across multiple domains using different metrics. We further propose a hybrid bias detection pipeline that integrates rule-based heuristics, semantic similarity filtering, and LLM-based validation. We also introduce a high-quality, bias-annotated dataset based on human evaluations of 1,439 translation-reference pairs. The code and dataset are accessible on GitHub: https://github.com/faiyazabdullah/TranslationTangles

Problem

Research questions and friction points this paper is trying to address.

Evaluating translation quality and fairness gaps in multilingual LLMs

Measuring performance disparities across language families and domains

Detecting and mitigating biases amplified in low-resource language translations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework for evaluating translation quality and fairness

Hybrid bias detection pipeline integrating multiple validation methods

High-quality bias-annotated dataset based on human evaluations

🔎 Similar Papers

A Survey on Large Language Models with Multilingualism: Recent Advances and New Frontiers