🤖 AI Summary
This work addresses the challenge of hallucination in large language models when tackling complex medical question answering, where conventional single-retrieval retrieval-augmented generation (RAG) struggles to support multi-step reasoning. To this end, the authors propose a self-reflective hybrid RAG framework that emulates the clinical “hypothesis–verification” workflow. The approach integrates BM25 and Contriever retrievers via reciprocal rank fusion (RRF), generates answers grounded in explicit reasoning chains, and incorporates a lightweight self-reflection module—based on either natural language inference (NLI) or a large language model—to iteratively verify and refine responses. Query reformulation is further employed to enhance retrieval quality. Evaluated on MedQA and PubMedQA, the method achieves accuracy rates of 83.33% and 79.82%, respectively, significantly outperforming single-retriever baselines and effectively reducing unsupported answers.
📝 Abstract
Large Language Models (LLMs) have demonstrated significant potential in medical Question Answering (QA), yet they remain prone to hallucinations and ungrounded reasoning, limiting their reliability in high-stakes clinical scenarios. While Retrieval-Augmented Generation (RAG) mitigates these issues by incorporating external knowledge, conventional single-shot retrieval often fails to resolve complex biomedical queries requiring multi-step inference. To address this, we propose Self-MedRAG, a self-reflective hybrid framework designed to mimic the iterative hypothesis-verification process of clinical reasoning. Self-MedRAG integrates a hybrid retrieval strategy, combining sparse (BM25) and dense (Contriever) retrievers via Reciprocal Rank Fusion (RRF) to maximize evidence coverage. It employs a generator to produce answers with supporting rationales, which are then assessed by a lightweight self-reflection module using Natural Language Inference (NLI) or LLM-based verification. If the rationale lacks sufficient evidentiary support, the system autonomously reformulates the query and iterates to refine the context. We evaluated Self-MedRAG on the MedQA and PubMedQA benchmarks. The results demonstrate that our hybrid retrieval approach significantly outperforms single-retriever baselines. Furthermore, the inclusion of the self-reflective loop yielded substantial gains, increasing accuracy on MedQA from 80.00% to 83.33% and on PubMedQA from 69.10% to 79.82%. These findings confirm that integrating hybrid retrieval with iterative, evidence-based self-reflection effectively reduces unsupported claims and enhances the clinical reliability of LLM-based systems.