BioMistral-NLU: Towards More Generalizable Medical Language Understanding through Instruction Tuning

📅 2024-10-24
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit weak zero-shot generalization on specialized medical natural language understanding (NLU) tasks—such as domain-knowledge-intensive reasoning, fine-grained semantic interpretation, and structured information extraction. Method: We propose a unified prompting framework coupled with cross-task medical instruction tuning. Leveraging BioMistral as the base model, we construct MNLU-Instruct—a high-quality, instruction-based dataset covering seven medical NLU task categories. It integrates diverse open-source biomedical corpora and employs structured prompt design and supervised instruction fine-tuning. Contribution/Results: Empirical results demonstrate that task diversity yields greater gains than mere data scaling. On six benchmark tasks across BLUE and BLURB, our method achieves superior zero-shot performance over the original BioMistral, ChatGPT, and GPT-4. This is the first demonstration of strong cross-task generalization enabled by unified prompting in medical LLMs, establishing a lightweight, instruction-driven paradigm for domain adaptation.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) such as ChatGPT are fine-tuned on large and diverse instruction-following corpora, and can generalize to new tasks. However, those instruction-tuned LLMs often perform poorly in specialized medical natural language understanding (NLU) tasks that require domain knowledge, granular text comprehension, and structured data extraction. To bridge the gap, we: (1) propose a unified prompting format for 7 important NLU tasks, (2) curate an instruction-tuning dataset, MNLU-Instruct, utilizing diverse existing open-source medical NLU corpora, and (3) develop BioMistral-NLU, a generalizable medical NLU model, through fine-tuning BioMistral on MNLU-Instruct. We evaluate BioMistral-NLU in a zero-shot setting, across 6 important NLU tasks, from two widely adopted medical NLU benchmarks: BLUE and BLURB. Our experiments show that our BioMistral-NLU outperforms the original BioMistral, as well as the proprietary LLMs - ChatGPT and GPT-4. Our dataset-agnostic prompting strategy and instruction tuning step over diverse NLU tasks enhance LLMs' generalizability across diverse medical NLU tasks. Our ablation experiments show that instruction-tuning on a wider variety of tasks, even when the total number of training instances remains constant, enhances downstream zero-shot generalization.
Problem

Research questions and friction points this paper is trying to address.

Improves medical language understanding through instruction tuning.
Addresses poor performance of LLMs in specialized medical NLU tasks.
Enhances generalizability across diverse medical NLU tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified prompting format for NLU tasks
Instruction-tuning dataset MNLU-Instruct creation
Fine-tuning BioMistral for medical NLU
🔎 Similar Papers
No similar papers found.