RankArena: A Unified Platform for Evaluating Retrieval, Reranking and RAG with Human and LLM Feedback

📅 2025-08-07

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Current RAG and re-ranking systems lack scalable, user-centric, and multi-perspective evaluation tools. To address this, we propose the first unified platform enabling end-to-end joint evaluation of retrieval, re-ranking, and RAG. Our method innovatively integrates dual feedback mechanisms—human expert annotation and LLM-as-a-judge—supporting pairwise comparison, full-list labeling, blind voting, visualized ranking, and structured metadata collection. The platform enables fine-grained relevance annotation and question-answering quality analysis, producing high-quality, reusable, structured evaluation datasets that directly facilitate downstream tasks such as re-ranker optimization and reward modeling. All code is open-sourced, and an online demo is provided. Empirical results demonstrate significant improvements in evaluation reliability, interpretability, and engineering practicality.

Technology Category

Application Category

📝 Abstract

Evaluating the quality of retrieval-augmented generation (RAG) and document reranking systems remains challenging due to the lack of scalable, user-centric, and multi-perspective evaluation tools. We introduce RankArena, a unified platform for comparing and analysing the performance of retrieval pipelines, rerankers, and RAG systems using structured human and LLM-based feedback as well as for collecting such feedback. RankArena supports multiple evaluation modes: direct reranking visualisation, blind pairwise comparisons with human or LLM voting, supervised manual document annotation, and end-to-end RAG answer quality assessment. It captures fine-grained relevance feedback through both pairwise preferences and full-list annotations, along with auxiliary metadata such as movement metrics, annotation time, and quality ratings. The platform also integrates LLM-as-a-judge evaluation, enabling comparison between model-generated rankings and human ground truth annotations. All interactions are stored as structured evaluation datasets that can be used to train rerankers, reward models, judgment agents, or retrieval strategy selectors. Our platform is publicly available at https://rankarena.ngrok.io/, and the Demo video is provided https://youtu.be/jIYAP4PaSSI.

Problem

Research questions and friction points this paper is trying to address.

Lack scalable tools for RAG and reranking evaluation

Need unified platform for multi-perspective feedback collection

Require comparison of human and LLM-based ranking judgments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified platform for retrieval and RAG evaluation

Combines human and LLM feedback for analysis

Stores interactions as structured training datasets

🔎 Similar Papers

Generating Diverse Criteria On-the-Fly to Improve Point-wise LLM Rankers