Worse than Zero-shot? A Fact-Checking Dataset for Evaluating the Robustness of RAG Against Misleading Retrievals

📅 2025-02-22

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing RAG systems exhibit critical robustness deficiencies under realistic misinformation-rich retrieval conditions, yet current evaluation paradigms rely on idealized, noise-free retrieval—failing to assess performance against naturally occurring misleading evidence. Method: We introduce RAGuard, the first fact-checking benchmark explicitly designed to evaluate RAG robustness against natural misinformation. Built upon real-world political discussions from Reddit, RAGuard systematically constructs three types of retrieval evidence—supporting, misleading, and irrelevant—to emulate authentic biases, contradictions, and falsehoods encountered in practice. Contribution/Results: Experiments reveal that mainstream RAG systems suffer substantial accuracy degradation under misleading retrieval—often falling below zero-shot baselines—demonstrating severe vulnerability in realistic misinformation scenarios. RAGuard establishes the first systematic, quantitative evaluation framework for natural-misinformation robustness in RAG, providing a novel benchmark and empirical foundation for developing trustworthy RAG systems.

Technology Category

Application Category

📝 Abstract

Retrieval-augmented generation (RAG) has shown impressive capabilities in mitigating hallucinations in large language models (LLMs). However, LLMs struggle to handle misleading retrievals and often fail to maintain their own reasoning when exposed to conflicting or selectively-framed evidence, making them vulnerable to real-world misinformation. In such real-world retrieval scenarios, misleading and conflicting information is rampant, particularly in the political domain, where evidence is often selectively framed, incomplete, or polarized. However, existing RAG benchmarks largely assume a clean retrieval setting, where models succeed by accurately retrieving and generating answers from gold-standard documents. This assumption fails to align with real-world conditions, leading to an overestimation of RAG system performance. To bridge this gap, we introduce RAGuard, a fact-checking dataset designed to evaluate the robustness of RAG systems against misleading retrievals. Unlike prior benchmarks that rely on synthetic noise, our dataset constructs its retrieval corpus from Reddit discussions, capturing naturally occurring misinformation. It categorizes retrieved evidence into three types: supporting, misleading, and irrelevant, providing a realistic and challenging testbed for assessing how well RAG systems navigate different retrieval information. Our benchmark experiments reveal that when exposed to misleading retrievals, all tested LLM-powered RAG systems perform worse than their zero-shot baselines (i.e., no retrieval at all), highlighting their susceptibility to noisy environments. To the best of our knowledge, RAGuard is the first benchmark to systematically assess RAG robustness against misleading evidence. We expect this benchmark will drive future research toward improving RAG systems beyond idealized datasets, making them more reliable for real-world applications.

Problem

Research questions and friction points this paper is trying to address.

Evaluates RAG robustness against misleading retrievals.

Introduces RAGuard dataset with Reddit-sourced misinformation.

Assesses RAG performance in noisy, real-world conditions.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces RAGuard dataset

Evaluates RAG robustness

Uses Reddit discussions

🔎 Similar Papers

No similar papers found.

Authors to Follow