🤖 AI Summary
Large language models (LLMs) exhibit weak zero-shot generalization on specialized medical natural language understanding (NLU) tasks—such as domain-knowledge-intensive reasoning, fine-grained semantic interpretation, and structured information extraction.
Method: We propose a unified prompting framework coupled with cross-task medical instruction tuning. Leveraging BioMistral as the base model, we construct MNLU-Instruct—a high-quality, instruction-based dataset covering seven medical NLU task categories. It integrates diverse open-source biomedical corpora and employs structured prompt design and supervised instruction fine-tuning.
Contribution/Results: Empirical results demonstrate that task diversity yields greater gains than mere data scaling. On six benchmark tasks across BLUE and BLURB, our method achieves superior zero-shot performance over the original BioMistral, ChatGPT, and GPT-4. This is the first demonstration of strong cross-task generalization enabled by unified prompting in medical LLMs, establishing a lightweight, instruction-driven paradigm for domain adaptation.
📝 Abstract
Large language models (LLMs) such as ChatGPT are fine-tuned on large and diverse instruction-following corpora, and can generalize to new tasks. However, those instruction-tuned LLMs often perform poorly in specialized medical natural language understanding (NLU) tasks that require domain knowledge, granular text comprehension, and structured data extraction. To bridge the gap, we: (1) propose a unified prompting format for 7 important NLU tasks, (2) curate an instruction-tuning dataset, MNLU-Instruct, utilizing diverse existing open-source medical NLU corpora, and (3) develop BioMistral-NLU, a generalizable medical NLU model, through fine-tuning BioMistral on MNLU-Instruct. We evaluate BioMistral-NLU in a zero-shot setting, across 6 important NLU tasks, from two widely adopted medical NLU benchmarks: BLUE and BLURB. Our experiments show that our BioMistral-NLU outperforms the original BioMistral, as well as the proprietary LLMs - ChatGPT and GPT-4. Our dataset-agnostic prompting strategy and instruction tuning step over diverse NLU tasks enhance LLMs' generalizability across diverse medical NLU tasks. Our ablation experiments show that instruction-tuning on a wider variety of tasks, even when the total number of training instances remains constant, enhances downstream zero-shot generalization.