🤖 AI Summary
Existing open biomedical corpora are inadequate for large language model training due to limitations in scale, quality, and knowledge-structure representation. To address this, we propose the first MeSH ontology-guided, multi-agent collaborative corpus distillation framework. It employs knowledge-driven agent specialization—literature extraction, question-answer generation, and self-assessment—to enable fully automated construction of AI-ready biomedical corpora, ensuring knowledge consistency while minimizing human annotation effort. This work provides the first empirical validation of multi-agent systems for scientific corpus construction, demonstrating both effectiveness and interpretability. Fine-tuning Llama3-70B on the distilled corpus yields state-of-the-art performance on biomedical QA tasks, outperforming GPT-4 (with MedPrompt) and Med-PaLM-2, and significantly surpassing existing open-source life science foundation models.
📝 Abstract
The rapid progress of large language models (LLMs) in biomedical research has underscored the limitations of existing open-source annotated scientific corpora, which are often insufficient in quantity and quality. Addressing the challenge posed by the complex hierarchy of biomedical knowledge, we propose a knowledge-driven, multi-agent framework for scientific corpus distillation tailored for LLM training in the biomedical domain. Central to our approach is a collaborative multi-agent architecture, where specialized agents, each guided by the Medical Subject Headings (MeSH) hierarchy, work in concert to autonomously extract, synthesize, and self-evaluate high-quality textual data from vast scientific literature. These agents collectively generate and refine domain-specific question-answer pairs, ensuring comprehensive coverage and consistency with biomedical ontologies while minimizing manual involvement. Extensive experimental results show that language models trained on our multi-agent distilled datasets achieve notable improvements in biomedical question-answering tasks, outperforming both strong life sciences LLM baselines and advanced proprietary models. Notably, our AI-Ready dataset enables Llama3-70B to surpass GPT-4 with MedPrompt and Med-PaLM-2, despite their larger scale. Detailed ablation studies and case analyses further validate the effectiveness and synergy of each agent within the framework, highlighting the potential of multi-agent collaboration in biomedical LLM training.