RAG-Zeval: Towards Robust and Interpretable Evaluation on RAG Responses through End-to-End Rule-Guided Reasoning

πŸ“… 2025-05-28
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing RAG evaluation frameworks face two key bottlenecks: high computational overhead from multi-step LLM prompting and difficulty in generating precise, interpretable pointwise reward signals. This paper proposes RAG-Zevalβ€”a framework that models faithfulness and correctness assessment as rule-guided end-to-end reasoning tasks, trained via reinforcement learning to yield lightweight, single-pass evaluators that generate holistic scores with explicit attribution chains. It introduces a novel ranking-driven preference reward mechanism and a zero-annotation synthetic ranking reference method, eliminating reliance on human annotations and pointwise rewards. Experiments demonstrate that RAG-Zeval achieves the highest correlation with human judgments across multiple benchmarks, significantly outperforming mainstream LLM-based evaluators while reducing computational cost by 10–100Γ—. Notably, it is the first approach to enable small-scale models to surpass hundred-billion-parameter LLMs in both evaluation accuracy and interpretability.

Technology Category

Application Category

πŸ“ Abstract
Robust evaluation is critical for deploying trustworthy retrieval-augmented generation (RAG) systems. However, current LLM-based evaluation frameworks predominantly rely on directly prompting resource-intensive models with complex multi-stage prompts, underutilizing models' reasoning capabilities and introducing significant computational cost. In this paper, we present RAG-Zeval (RAG-Zero Evaluator), a novel end-to-end framework that formulates faithfulness and correctness evaluation as a rule-guided reasoning task. Our approach trains evaluators with reinforcement learning, facilitating compact models to generate comprehensive and sound assessments with detailed explanation in one-pass. We introduce a ranking-based outcome reward mechanism, using preference judgments rather than absolute scores, to address the challenge of obtaining precise pointwise reward signals. To this end, we synthesize the ranking references by generating quality-controlled responses with zero human annotation. Experiments demonstrate RAG-Zeval's superior performance, achieving the strongest correlation with human judgments and outperforming baselines that rely on LLMs with 10-100 times more parameters. Our approach also exhibits superior interpretability in response evaluation.
Problem

Research questions and friction points this paper is trying to address.

Robust evaluation of RAG systems for trustworthiness
Reducing computational cost in LLM-based evaluation frameworks
Improving interpretability and accuracy in response assessments
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end rule-guided reasoning framework
Reinforcement learning for evaluator training
Ranking-based outcome reward mechanism
πŸ”Ž Similar Papers
No similar papers found.