MedRAGChecker: Claim-Level Verification for Biomedical Retrieval-Augmented Generation

📅 2026-01-10

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This work addresses the safety risks posed by biomedical retrieval-augmented generation (RAG) systems, which often produce unsupported or even evidence-contradictory claims in long-form responses. To tackle this issue, the authors propose MedRAGChecker, the first framework enabling claim-level, fine-grained verification of RAG outputs. It decomposes generated answers into atomic claims and integrates evidence-driven natural language inference with consistency checks against biomedical knowledge graphs. Leveraging reliability-weighted ensemble scoring and lightweight model distillation, MedRAGChecker effectively identifies unsupported or contradictory claims across four biomedical question-answering benchmarks. The approach precisely distinguishes failure modes originating from retrieval versus generation components and reveals critical differences among models in their handling of safety-sensitive relational assertions.

Technology Category

Application Category

📝 Abstract

Biomedical retrieval-augmented generation (RAG) can ground LLM answers in medical literature, yet long-form outputs often contain isolated unsupported or contradictory claims with safety implications. We introduce MedRAGChecker, a claim-level verification and diagnostic framework for biomedical RAG. Given a question, retrieved evidence, and a generated answer, MedRAGChecker decomposes the answer into atomic claims and estimates claim support by combining evidence-grounded natural language inference (NLI) with biomedical knowledge-graph (KG) consistency signals. Aggregating claim decisions yields answer-level diagnostics that help disentangle retrieval and generation failures, including faithfulness, under-evidence, contradiction, and safety-critical error rates. To enable scalable evaluation, we distill the pipeline into compact biomedical models and use an ensemble verifier with class-specific reliability weighting. Experiments on four biomedical QA benchmarks show that MedRAGChecker reliably flags unsupported and contradicted claims and reveals distinct risk profiles across generators, particularly on safety-critical biomedical relations.

Problem

Research questions and friction points this paper is trying to address.

biomedical retrieval-augmented generation

claim-level verification

unsupported claims

contradictory claims

safety-critical errors

Innovation

Methods, ideas, or system contributions that make the work stand out.

claim-level verification

biomedical RAG

knowledge graph consistency