MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems

📅 2024-10-17
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing RAG evaluation relies on either human-annotated heuristic metrics or costly LLM-based judges, compromising efficiency, scalability, and multilingual support. Method: We introduce the first synthetic RAG evaluation arena covering 18 languages and propose a lightweight proxy judge model. It integrates heuristic features (e.g., ROUGE, F1) and is trained via supervised learning under the Bradley–Terry pairwise comparison framework to replace expensive LLM judges. Contribution/Results: Evaluated on a synthetic multilingual QA benchmark built from Wikipedia corpora, our proxy judge achieves high rank correlation with LLM judges (Kendall’s τ = 0.909), significantly improving evaluation efficiency and reproducibility. Experiments across 19 multilingual large language models reveal performance boundaries of both closed-source and large open-source models. All code and datasets are publicly released.

Technology Category

Application Category

📝 Abstract
Traditional retrieval-augmented generation (RAG) benchmarks evaluate systems using heuristic-based metrics, but these require human preferences as the ground truth for reference. In contrast, arena-based benchmarks, where systems compete against each other, require an expensive large language model (LLM) as a judge for a reliable evaluation. We present a simple efficient technique to combine the best of both worlds. The idea is to train a surrogate judge using heuristic metrics as input, to output the LLM as a judge prediction. In our work, we develop MIRAGE-Bench, a synthetic arena-based RAG benchmark for 18 diverse languages on Wikipedia focused on multilingual answer generation evaluation. It extensively couples both heuristic features and LLM as a judge for evaluation. We benchmark 19 multilingual LLMs, and observe a high correlation (Kendall Tau ($ au$) = 0.909) using our surrogate judge and between GPT-4o as a teacher using the Bradley-Terry framework. Our results show proprietary and large open-source LLMs currently dominate on MIRAGE-Bench. Our code and datasets are made publicly available here: https://github.com/vectara/mirage-bench.
Problem

Research questions and friction points this paper is trying to address.

Develops MIRAGE-Bench for multilingual RAG evaluation
Combines heuristic metrics with LLM judge efficiently
Evaluates 19 multilingual LLMs on synthetic benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

Train surrogate judge with heuristic metrics
Combine heuristic features and LLM judge
Synthetic multilingual RAG benchmark arena
🔎 Similar Papers
No similar papers found.