🤖 AI Summary
The absence of systematic evaluation benchmarks for Persian text embeddings hinders reproducible and standardized assessment. Method: This paper introduces FaMTEB—the first large-scale, multi-task, Persian-specific benchmark—built upon an extended MTEB framework. It integrates 63 datasets across seven task categories: classification, clustering, retrieval, re-ranking, summary retrieval (a newly proposed task), chatbot evaluation (the first such inclusion in the MTEB ecosystem), and paraphrase identification. Data sources include translation, synthetic generation, and original Persian content, featuring multiple high-quality, publicly released Persian NLP datasets. Contribution/Results: FaMTEB is fully open-source, providing standardized evaluation code, unified protocols, and a public leaderboard. Comprehensive experiments evaluate over ten Persian-specific and multilingual embedding models, substantially improving reproducibility and standardization in Persian text representation evaluation.
📝 Abstract
In this paper, we introduce a comprehensive benchmark for Persian (Farsi) text embeddings, built upon the Massive Text Embedding Benchmark (MTEB). Our benchmark includes 63 datasets spanning seven different tasks: classification, clustering, pair classification, reranking, retrieval, summary retrieval, and semantic textual similarity. The datasets are formed as a combination of existing, translated, and newly generated data, offering a diverse evaluation framework for Persian language models. Given the increasing use of text embedding models in chatbots, evaluation datasets are becoming inseparable ingredients in chatbot challenges and Retrieval-Augmented Generation systems. As a contribution, we include chatbot evaluation datasets in the MTEB benchmark for the first time. In addition, in this paper, we introduce the new task of summary retrieval which is not part of the tasks included in standard MTEB. Another contribution of this paper is the introduction of a substantial number of new Persian language NLP datasets suitable for training and evaluation, some of which have no previous counterparts in Persian. We evaluate the performance of several Persian and multilingual embedding models in a range of tasks. This work introduces an open-source benchmark with datasets, code and a public leaderboard.