SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation

📅 2026-06-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the longstanding scarcity of systematic evaluation benchmarks for Slovak, a low-resource West Slavic language, and the limited performance of existing local embedding models. The authors introduce SkMTEB, the first MTEB-style benchmark for Slovak, encompassing seven task categories and 31 datasets—nearly four times the coverage of prior benchmarks. Building upon the Multilingual E5 architecture, they develop two lightweight, locally deployable models, e5-sk-small (45M parameters) and e5-sk-large (365M parameters), through vocabulary pruning and instruction fine-tuning. Experimental results demonstrate that both models achieve semantic search and retrieval-augmented generation (RAG) performance on par with commercial APIs, establishing an efficient and reproducible embedding paradigm for low-resource languages.

📝 Abstract

We introduce SkMTEB, the first comprehensive MTEB-style text embedding benchmark for Slovak, a low-resource West Slavic language, comprising 31 datasets across 7 task types -- nearly 4$\times$ the depth of existing multilingual benchmark coverage for Slovak. Our evaluation of 31 embedding models reveals that large instruction-tuned multilingual models achieve the strongest performance, while existing Slovak-specific models trained for NLU tasks transfer poorly to embedding tasks. To address the need for efficient, locally-deployable Slovak embeddings, we develop \texttt{e5-sk-small} (45M parameters) and \texttt{e5-sk-large} (365M) by applying vocabulary trimming and fine-tuning to Multilingual E5 models. Despite size reductions of up to 62\%, our open-source models achieve competitive performance with proprietary APIs while remaining locally deployable for semantic search and retrieval-augmented generation (RAG). We release the benchmark, models, datasets, and code openly, hoping our approach offers a replicable path for other under-resourced languages.

Problem

Research questions and friction points this paper is trying to address.

low-resource language

text embedding benchmark

Slovak

embedding models

local deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Slovak text embedding

low-resource language

vocabulary trimming