🤖 AI Summary
Real-world RAG systems suffer from heterogeneous error types that are difficult to diagnose during deployment. Method: This paper introduces the first production-oriented RAG error taxonomy, categorizing failures along four dimensions—retrieval, generation, alignment, and hallucination—and constructs RAG-ErrorBank, the first large-scale, human-annotated dataset of RAG error types, publicly released. It further designs an automated error detection and evaluation framework strictly aligned with the taxonomy, enabling fine-grained error localization and robustness quantification. Contributions/Results: Experiments demonstrate a +28.6% improvement in error identification accuracy over baselines. The framework provides interpretable, reusable diagnostic pathways for systematic debugging and optimization. All components—including source code, RAG-ErrorBank, and evaluation tools—are fully open-sourced to foster reproducible research and practical deployment.
📝 Abstract
Retrieval-augmented generation (RAG) is a prevalent approach for building LLM-based question-answering systems that can take advantage of external knowledge databases. Due to the complexity of real-world RAG systems, there are many potential causes for erroneous outputs. Understanding the range of errors that can occur in practice is crucial for robust deployment. We present a new taxonomy of the error types that can occur in realistic RAG systems, examples of each, and practical advice for addressing them. Additionally, we curate a dataset of erroneous RAG responses annotated by error types. We then propose an auto-evaluation method aligned with our taxonomy that can be used in practice to track and address errors during development. Code and data are available at https://github.com/layer6ai-labs/rag-error-classification.