XLGoBench: Detecting cross-lingual skill gaps with algorithmic tasks

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

This work addresses the significant performance disparities exhibited by current large language models when executing identical tasks across different languages. To systematically evaluate this issue, the authors propose a cross-lingual algorithmic task benchmark that ensures linguistic parity, scalability, quantifiability, and transparency. The benchmark leverages templated generation of multilingual synthetic data and employs objective correctness metrics for evaluation. Experimental results demonstrate that state-of-the-art large language models consistently display pronounced cross-lingual performance gaps, thereby validating the effectiveness and necessity of the proposed benchmark in uncovering linguistic capability disparities inherent in these models.

📝 Abstract

We introduce a set of synthetic algorithmic tasks to detect cross-lingual gaps in the abilities of large language models. Our benchmark is commensurate across languages, since it requires models to perform the same underlying task in different languages; scalable, since each task can be generated at varying levels of complexity allowing it to be adapted to models with different capabilities; quantifiable, since every task admits an objective notion of correctness; and transparent, since tasks are generated from simple templates that can be readily audited for translation errors. Because our benchmark focuses on algorithmic tasks, differential performance is a sufficient -- but not necessary -- indicator of cross-lingual gaps. Nevertheless, we show through extensive experiments that our benchmark exposes persistent cross-lingual gaps in multiple state-of-the-art models.

Problem

Research questions and friction points this paper is trying to address.

cross-lingual gaps

large language models

algorithmic tasks

benchmark

multilingual evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-lingual evaluation

algorithmic tasks

synthetic benchmark

language model gaps

scalable testing

🔎 Similar Papers

Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models

2024-06-23arXiv.orgCitations: 0

Are LLMs Good Cryptic Crossword Solvers?

2024-03-15arXiv.orgCitations: 2

💼 Related Jobs

Research Scientist, AI Language