The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models

📅 2025-04-21

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the reliance on manual annotation and poor scalability in evaluating Retrieval-Augmented Generation (RAG) systems, this paper proposes AutoNuggetizer—a novel framework that achieves, for the first time, a fully automated, closed-loop implementation of the classic TREC nugget-based evaluation paradigm. AutoNuggetizer leverages large language models (LLMs) to precisely extract atomic factual units (“nuggets”) from reference answers and automatically assess their coverage in RAG outputs via semantic matching. A key innovation is its component-wise calibration strategy, ensuring high alignment between each module’s outputs and human annotations. Evaluated on the TREC 2024 RAG Track, AutoNuggetizer achieves strong operational-level agreement with human judgments (Spearman ρ > 0.9) for nugget-level scoring, and its standalone nugget assignment module outperforms baseline methods. This work establishes a new, efficient, reliable, and reproducible paradigm for automated RAG evaluation.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have significantly enhanced the capabilities of information access systems, especially with retrieval-augmented generation (RAG). Nevertheless, the evaluation of RAG systems remains a barrier to continued progress, a challenge we tackle in this work by proposing an automatic evaluation framework that is validated against human annotations. We believe that the nugget evaluation methodology provides a solid foundation for evaluating RAG systems. This approach, originally developed for the TREC Question Answering (QA) Track in 2003, evaluates systems based on atomic facts that should be present in good answers. Our efforts focus on"refactoring"this methodology, where we describe the AutoNuggetizer framework that specifically applies LLMs to both automatically create nuggets and automatically assign nuggets to system answers. In the context of the TREC 2024 RAG Track, we calibrate a fully automatic approach against strategies where nuggets are created manually or semi-manually by human assessors and then assigned manually to system answers. Based on results from a community-wide evaluation, we observe strong agreement at the run level between scores derived from fully automatic nugget evaluation and human-based variants. The agreement is stronger when individual framework components such as nugget assignment are automated independently. This suggests that our evaluation framework provides tradeoffs between effort and quality that can be used to guide the development of future RAG systems. However, further research is necessary to refine our approach, particularly in establishing robust per-topic agreement to diagnose system failures effectively.

Problem

Research questions and friction points this paper is trying to address.

Automating fact extraction for RAG evaluation using LLMs

Validating automatic evaluation against human annotations

Improving RAG system development with tradeoffs between effort and quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated RAG evaluation using LLMs

AutoNuggetizer framework for fact extraction

Calibrated automatic vs human nugget scoring

🔎 Similar Papers

No similar papers found.

Authors to Follow