On the Robustness of Multilingual Text Embedding Rankings Across Learning Tasks, Languages, and Benchmark Datasets

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This study addresses the lack of systematic evaluation of performance stability across tasks and languages in existing large-scale multilingual text embedding models, noting that benchmark conclusions are often sensitive to dataset composition and aggregation methodologies. To this end, the authors propose a meta-research framework based on multi-criteria decision-making ranking, enabling robust cross-task and cross-lingual analysis of models covering approximately 230 languages on the MTEB benchmark. The framework introduces two novel metrics—“dataset composition robustness” and “ranking scheme robustness”—to facilitate systematic sensitivity assessment of benchmark findings. Results reveal that while large models generally exhibit stable performance across most tasks, notable exceptions arise in retrieval tasks; moreover, only a handful of models consistently outperform others across diverse tasks, ranking strategies, and dataset subsets.

📝 Abstract

Large-scale multilingual text embedding models play crucial role in both research and industry, yet their behavior in language-specific, multi-task settings remains insufficiently understood. Although benchmarking platforms such as MTEB report results across more than 250 languages, conclusions about model superiority often depend on implicit choices of dataset compositions and performance aggregation methods. To address this gap, we present a meta-study of multilingual model performance robustness in MTEB, applying a diverse set of multi-criteria decision-making ranking schemes and introducing two robustness indicators: dataset-composition robustness (sensitivity of rankings to changing dataset compositions) and ranking-scheme robustness (sensitivity to aggregation method change). They enable systematic sensitivity analysis of whether benchmarking conclusions remain stable under different evaluation designs. We conduct an in-depth analysis on five languages (English, French, German, Hindi, and Spanish) across nine tasks (e.g., classification, clustering, retrieval) and release results for approximately 230 additional languages. The task-specific analyses show that large-scale LLM-based models are often robust top performers, though not uniformly (e.g., in retrieval task), while task-agnostic results reveal that only a small subset of models remains consistently strong across tasks, ranking schemes, and data subsamples.

Problem

Research questions and friction points this paper is trying to address.

multilingual text embedding

benchmark robustness

ranking stability

evaluation methodology

cross-task generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual text embeddings

robustness analysis

ranking schemes