Ready to Translate, Not to Represent? Bias and Performance Gaps in Multilingual LLMs Across Language Families and Domains

๐Ÿ“… 2025-10-09
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Large language models (LLMs) exhibit imbalanced performance in machine translation (MT) across language families and domains, particularly amplifying societal biases present in training data for low-resource languagesโ€”thereby compromising translation fairness. To address this, we propose Translation Tangles, a hybrid bias detection framework integrating rule-based heuristics, semantic similarity filtering, and LLM-based validation. We introduce the first high-quality, human-annotated dataset for MT fairness evaluation, comprising 1,439 translation-reference pairs across 24 bilingual directions and diverse domains. Furthermore, we design a unified evaluation paradigm that jointly incorporates multi-metric benchmarking, semantic similarity computation, LLM-based verification, and human assessment. All code, data, and evaluation tools are publicly released to foster reproducible research and community advancement in fair MT.

Technology Category

Application Category

๐Ÿ“ Abstract
The rise of Large Language Models (LLMs) has redefined Machine Translation (MT), enabling context-aware and fluent translations across hundreds of languages and textual domains. Despite their remarkable capabilities, LLMs often exhibit uneven performance across language families and specialized domains. Moreover, recent evidence reveals that these models can encode and amplify different biases present in their training data, posing serious concerns for fairness, especially in low-resource languages. To address these gaps, we introduce Translation Tangles, a unified framework and dataset for evaluating the translation quality and fairness of open-source LLMs. Our approach benchmarks 24 bidirectional language pairs across multiple domains using different metrics. We further propose a hybrid bias detection pipeline that integrates rule-based heuristics, semantic similarity filtering, and LLM-based validation. We also introduce a high-quality, bias-annotated dataset based on human evaluations of 1,439 translation-reference pairs. The code and dataset are accessible on GitHub: https://github.com/faiyazabdullah/TranslationTangles
Problem

Research questions and friction points this paper is trying to address.

Evaluating translation quality and fairness gaps in multilingual LLMs
Measuring performance disparities across language families and domains
Detecting and mitigating biases amplified in low-resource language translations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework for evaluating translation quality and fairness
Hybrid bias detection pipeline integrating multiple validation methods
High-quality bias-annotated dataset based on human evaluations
๐Ÿ”Ž Similar Papers
No similar papers found.