REPA: Russian Error Types Annotation for Evaluating Text Generation and Judgment Capabilities

📅 2025-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the effectiveness of large language models (LLMs) as evaluators for Russian-language generation, revealing a substantial performance gap compared to their English-language counterparts. To address this gap, we introduce REPA—the first fine-grained Russian error annotation dataset—comprising 1k queries and 2k responses annotated across 10 linguistic error types, alongside the first human-curated multidimensional preference annotation framework. We systematically evaluate six generative models and eight LLM-as-judge evaluators via human ranking, zero- and few-shot LLM-based evaluation, and bias analysis. Results indicate that LLM judges exhibit limited fine-grained discrimination capability in Russian, with low human–LLM preference alignment; positional and length biases significantly impair judgment consistency. This study fills a critical gap in Russian LLM evaluation and establishes a foundational benchmark dataset and methodology for non-English LLM assessment.

Technology Category

Application Category

📝 Abstract
Recent advances in large language models (LLMs) have introduced the novel paradigm of using LLMs as judges, where an LLM evaluates and scores the outputs of another LLM, which often correlates highly with human preferences. However, the use of LLM-as-a-judge has been primarily studied in English. In this paper, we evaluate this framework in Russian by introducing the Russian Error tyPes Annotation dataset (REPA), a dataset of 1k user queries and 2k LLM-generated responses. Human annotators labeled each response pair expressing their preferences across ten specific error types, as well as selecting an overall preference. We rank six generative LLMs across the error types using three rating systems based on human preferences. We also evaluate responses using eight LLM judges in zero-shot and few-shot settings. We describe the results of analyzing the judges and position and length biases. Our findings reveal a notable gap between LLM judge performance in Russian and English. However, rankings based on human and LLM preferences show partial alignment, suggesting that while current LLM judges struggle with fine-grained evaluation in Russian, there is potential for improvement.
Problem

Research questions and friction points this paper is trying to address.

Evaluate LLM-as-a-judge framework in Russian
Introduce REPA dataset for Russian error annotation
Assess LLM judge performance and biases in Russian
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduced REPA dataset for Russian error annotation
Ranked LLMs using human and LLM preference systems
Evaluated LLM judges in zero-shot and few-shot settings
🔎 Similar Papers
No similar papers found.
A
Alexander Pugachev
Higher School of Economics
A
Alena Fenogenova
Higher School of Economics, SaluteDevices
V
V. Mikhailov
University of Oslo
Ekaterina Artemova
Ekaterina Artemova
Toloka.AI, ex-HSE, ex-LMU
natural language processingbenchmarkinglarge language models