m-KAILIN: Knowledge-Driven Agentic Scientific Corpus Distillation Framework for Biomedical Large Language Models Training

📅 2025-04-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing open biomedical corpora are inadequate for large language model training due to limitations in scale, quality, and knowledge-structure representation. To address this, we propose the first MeSH ontology-guided, multi-agent collaborative corpus distillation framework. It employs knowledge-driven agent specialization—literature extraction, question-answer generation, and self-assessment—to enable fully automated construction of AI-ready biomedical corpora, ensuring knowledge consistency while minimizing human annotation effort. This work provides the first empirical validation of multi-agent systems for scientific corpus construction, demonstrating both effectiveness and interpretability. Fine-tuning Llama3-70B on the distilled corpus yields state-of-the-art performance on biomedical QA tasks, outperforming GPT-4 (with MedPrompt) and Med-PaLM-2, and significantly surpassing existing open-source life science foundation models.

Technology Category

Application Category

📝 Abstract
The rapid progress of large language models (LLMs) in biomedical research has underscored the limitations of existing open-source annotated scientific corpora, which are often insufficient in quantity and quality. Addressing the challenge posed by the complex hierarchy of biomedical knowledge, we propose a knowledge-driven, multi-agent framework for scientific corpus distillation tailored for LLM training in the biomedical domain. Central to our approach is a collaborative multi-agent architecture, where specialized agents, each guided by the Medical Subject Headings (MeSH) hierarchy, work in concert to autonomously extract, synthesize, and self-evaluate high-quality textual data from vast scientific literature. These agents collectively generate and refine domain-specific question-answer pairs, ensuring comprehensive coverage and consistency with biomedical ontologies while minimizing manual involvement. Extensive experimental results show that language models trained on our multi-agent distilled datasets achieve notable improvements in biomedical question-answering tasks, outperforming both strong life sciences LLM baselines and advanced proprietary models. Notably, our AI-Ready dataset enables Llama3-70B to surpass GPT-4 with MedPrompt and Med-PaLM-2, despite their larger scale. Detailed ablation studies and case analyses further validate the effectiveness and synergy of each agent within the framework, highlighting the potential of multi-agent collaboration in biomedical LLM training.
Problem

Research questions and friction points this paper is trying to address.

Insufficient biomedical corpora for LLM training
Complex hierarchy of biomedical knowledge challenges distillation
Need autonomous high-quality data extraction from literature
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent framework guided by MeSH hierarchy
Autonomous extraction and synthesis of scientific data
Generates domain-specific QA pairs with minimal manual effort
🔎 Similar Papers
No similar papers found.
M
Meng Xiao
Computer Network Information Center, Chinese Academy of Sciences, Beijing, China.
X
Xunxin Cai
Computer Network Information Center, Chinese Academy of Sciences, Beijing, China.; University of Chinese Academy of Sciences, Beijing, China.
Chengrui Wang
Chengrui Wang
Alibaba Group
Computer Vision
Yuanchun Zhou
Yuanchun Zhou
Computer Network Information Center,CAS
Data MiningBig Data Analysis