🤖 AI Summary
Existing large medical language models predominantly rely on English-centric pretraining and generic knowledge distillation, suffering from factual inconsistency and limited multilingual support. To address this, we propose a multilingual retrieval-augmented generation (RAG) framework grounded in Wikipedia-based medical knowledge, constructing 500K verifiable and interpretable reasoning traces across English, Italian, and Spanish to ensure cross-lingual medical knowledge alignment and factual consistency. Our method integrates multilingual data translation augmentation, in-context learning, and supervised fine-tuning, trained and evaluated on MedQA and MedMCQA benchmarks. Experiments demonstrate that our 8B-parameter model achieves state-of-the-art performance on both in-domain and cross-domain medical question answering. We publicly release multilingual reasoning traces, translated datasets, medical Wikipedia corpora, and fine-tuned models—enhancing transparency, interpretability, and safety of multilingual clinical decision support systems.
📝 Abstract
Large Language Models (LLMs) with reasoning capabilities have recently demonstrated strong potential in medical Question Answering (QA). Existing approaches are largely English-focused and primarily rely on distillation from general-purpose LLMs, raising concerns about the reliability of their medical knowledge. In this work, we present a method to generate multilingual reasoning traces grounded in factual medical knowledge. We produce 500k traces in English, Italian, and Spanish, using a retrievalaugmented generation approach over medical information from Wikipedia. The traces are generated to solve medical questions drawn from MedQA and MedMCQA, which we extend to Italian and Spanish. We test our pipeline in both in-domain and outof-domain settings across Medical QA benchmarks, and demonstrate that our reasoning traces improve performance both when utilized via in-context learning (few-shot) and supervised fine-tuning, yielding state-of-the-art results among 8B-parameter LLMs. We believe that these resources can support the development of safer, more transparent clinical decision-support tools in multilingual settings. We release the full suite of resources: reasoning traces, translated QA datasets, Medical-Wikipedia, and fine-tuned models.