XLGoBench: Detecting cross-lingual skill gaps with algorithmic tasks

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF

career value

155K/year
🤖 AI Summary
This work addresses the significant performance disparities exhibited by current large language models when executing identical tasks across different languages. To systematically evaluate this issue, the authors propose a cross-lingual algorithmic task benchmark that ensures linguistic parity, scalability, quantifiability, and transparency. The benchmark leverages templated generation of multilingual synthetic data and employs objective correctness metrics for evaluation. Experimental results demonstrate that state-of-the-art large language models consistently display pronounced cross-lingual performance gaps, thereby validating the effectiveness and necessity of the proposed benchmark in uncovering linguistic capability disparities inherent in these models.
📝 Abstract
We introduce a set of synthetic algorithmic tasks to detect cross-lingual gaps in the abilities of large language models. Our benchmark is commensurate across languages, since it requires models to perform the same underlying task in different languages; scalable, since each task can be generated at varying levels of complexity allowing it to be adapted to models with different capabilities; quantifiable, since every task admits an objective notion of correctness; and transparent, since tasks are generated from simple templates that can be readily audited for translation errors. Because our benchmark focuses on algorithmic tasks, differential performance is a sufficient -- but not necessary -- indicator of cross-lingual gaps. Nevertheless, we show through extensive experiments that our benchmark exposes persistent cross-lingual gaps in multiple state-of-the-art models.
Problem

Research questions and friction points this paper is trying to address.

cross-lingual gaps
large language models
algorithmic tasks
benchmark
multilingual evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-lingual evaluation
algorithmic tasks
synthetic benchmark
language model gaps
scalable testing