🤖 AI Summary
This work addresses the lack of a multilingual, comprehensive evaluation benchmark for retrieval-augmented generation (RAG) systems in the medical domain by introducing MRAG, the first bilingual (Chinese–English) medical RAG benchmark. MRAG integrates corpora from Wikipedia and PubMed and is accompanied by an open-source toolkit, MRAG-Toolkit, which enables modular evaluation of individual RAG components—including retrievers, large language models, and prompting strategies. Experimental results demonstrate that RAG significantly enhances answer reliability and reasoning quality in medical tasks, albeit with a slight degradation in readability for long-form responses. Both the MRAG dataset and the MRAG-Toolkit will be fully open-sourced to advance research and applications of medical RAG systems.
📝 Abstract
While Retrieval-Augmented Generation (RAG) has been swiftly adopted in scientific and clinical QA systems, a comprehensive evaluation benchmark in the medical domain is lacking. To address this gap, we introduce the Medical Retrieval-Augmented Generation (MRAG) benchmark, covering various tasks in English and Chinese languages, and building a corpus with Wikipedia and Pubmed. Additionally, we develop the MRAG-Toolkit, facilitating systematic exploration of different RAG components. Our experiments reveal that: (a) RAG enhances LLM reliability across MRAG tasks. (b) the performance of RAG systems is influenced by retrieval approaches, model sizes, and prompting strategies. (c) While RAG improves usefulness and reasoning quality, LLM responses may become slightly less readable for long-form questions. We will release the MRAG-Bench's dataset and toolkit with CCBY-4.0 license upon acceptance, to facilitate applications from both academia and industry.