🤖 AI Summary
This study addresses the challenges of high domain complexity and scarce annotated data in biomedical multiple-choice question answering (MedQA), specifically on PubMedQA. Methodologically, it systematically investigates the synergistic optimization of open-source large language models (LLMs) via lightweight fine-tuning and prompt engineering—integrating zero-shot chain-of-thought (CoT) prompting, standard instruction tuning, and parameter-efficient QLoRA fine-tuning across multiple leading open LLM families. Key findings reveal that CoT prompting substantially improves zero-shot performance; instruction tuning delivers consistent gains; however, combining CoT-aware fine-tuning with QLoRA exhibits model-scale dependency—yielding performance degradation in certain larger models. This constitutes the first empirical evidence challenging the widely held assumption that CoT fine-tuning universally enhances performance, thereby delineating its applicability boundary. The work provides critical methodological insights and practical guidelines for biomedical LLM adaptation.
📝 Abstract
Large language models (LLMs) have shown great potential in medical question answering (MedQA), yet adapting them to biomedical reasoning remains challenging due to domain-specific complexity and limited supervision. In this work, we study how prompt design and lightweight fine-tuning affect the performance of open-source LLMs on PubMedQA, a benchmark for multiple-choice biomedical questions. We focus on two widely used prompting strategies - standard instruction prompts and Chain-of-Thought (CoT) prompts - and apply QLoRA for parameter-efficient instruction tuning. Across multiple model families and sizes, our experiments show that CoT prompting alone can improve reasoning in zero-shot settings, while instruction tuning significantly boosts accuracy. However, fine-tuning on CoT prompts does not universally enhance performance and may even degrade it for certain larger models. These findings suggest that reasoning-aware prompts are useful, but their benefits are model- and scale-dependent. Our study offers practical insights into combining prompt engineering with efficient finetuning for medical QA applications.