InspectorRAGet: An Introspection Platform for RAG Evaluation

📅 2024-04-26
🏛️ arXiv.org
📈 Citations: 2
Influential: 1
📄 PDF
🤖 AI Summary
Existing RAG systems lack deep, end-to-end evaluation tools for quality diagnosis and root-cause attribution—particularly those reconciling automated metrics, human judgment, and annotation reliability. To address this gap, we introduce the first explainable, multi-dimensional, human-in-the-loop introspection platform specifically designed for RAG systems. Our framework integrates an automated evaluation pipeline, annotation quality modeling, interactive visual analytics, and an open API. It supports both aggregate statistics and instance-level fine-grained attribution across key dimensions—including retrieval relevance, generation faithfulness, and information completeness—while quantifying annotator reliability via rigorous inter-annotator agreement modeling. We validate the platform across multiple public RAG benchmarks under diverse scenarios. All code and the platform are fully open-sourced, facilitating standardized, reproducible RAG evaluation research.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLM) have become a popular approach for implementing Retrieval Augmented Generation (RAG) systems, and a significant amount of effort has been spent on building good models and metrics. In spite of increased recognition of the need for rigorous evaluation of RAG systems, few tools exist that go beyond the creation of model output and automatic calculation. We present InspectorRAGet, an introspection platform for performing a comprehensive analysis of the quality of RAG system output. InspectorRAGet allows the user to analyze aggregate and instance-level performance of RAG systems, using both human and algorithmic metrics as well as annotator quality. InspectorRAGet is suitable for multiple use cases and is available publicly to the community. A live instance of the platform is available at https://ibm.biz/InspectorRAGet.
Problem

Research questions and friction points this paper is trying to address.

Lack of comprehensive RAG evaluation tools beyond basic metrics
Need for analyzing RAG performance at aggregate and instance levels
Combining human and algorithmic metrics for RAG quality assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introspection platform for RAG evaluation
Combines human and algorithmic metrics
Analyzes aggregate and instance-level performance
🔎 Similar Papers
No similar papers found.