🤖 AI Summary
This work addresses the safety risks posed by biomedical retrieval-augmented generation (RAG) systems, which often produce unsupported or even evidence-contradictory claims in long-form responses. To tackle this issue, the authors propose MedRAGChecker, the first framework enabling claim-level, fine-grained verification of RAG outputs. It decomposes generated answers into atomic claims and integrates evidence-driven natural language inference with consistency checks against biomedical knowledge graphs. Leveraging reliability-weighted ensemble scoring and lightweight model distillation, MedRAGChecker effectively identifies unsupported or contradictory claims across four biomedical question-answering benchmarks. The approach precisely distinguishes failure modes originating from retrieval versus generation components and reveals critical differences among models in their handling of safety-sensitive relational assertions.
📝 Abstract
Biomedical retrieval-augmented generation (RAG) can ground LLM answers in medical literature, yet long-form outputs often contain isolated unsupported or contradictory claims with safety implications. We introduce MedRAGChecker, a claim-level verification and diagnostic framework for biomedical RAG. Given a question, retrieved evidence, and a generated answer, MedRAGChecker decomposes the answer into atomic claims and estimates claim support by combining evidence-grounded natural language inference (NLI) with biomedical knowledge-graph (KG) consistency signals. Aggregating claim decisions yields answer-level diagnostics that help disentangle retrieval and generation failures, including faithfulness, under-evidence, contradiction, and safety-critical error rates. To enable scalable evaluation, we distill the pipeline into compact biomedical models and use an ensemble verifier with class-specific reliability weighting. Experiments on four biomedical QA benchmarks show that MedRAGChecker reliably flags unsupported and contradicted claims and reveals distinct risk profiles across generators, particularly on safety-critical biomedical relations.