Instruction Tuning and CoT Prompting for Contextual Medical QA with LLMs

📅 2025-06-13

📈 Citations: 2

✨ Influential: 0

🤖 AI Summary

This study addresses the challenges of high domain complexity and scarce annotated data in biomedical multiple-choice question answering (MedQA), specifically on PubMedQA. Methodologically, it systematically investigates the synergistic optimization of open-source large language models (LLMs) via lightweight fine-tuning and prompt engineering—integrating zero-shot chain-of-thought (CoT) prompting, standard instruction tuning, and parameter-efficient QLoRA fine-tuning across multiple leading open LLM families. Key findings reveal that CoT prompting substantially improves zero-shot performance; instruction tuning delivers consistent gains; however, combining CoT-aware fine-tuning with QLoRA exhibits model-scale dependency—yielding performance degradation in certain larger models. This constitutes the first empirical evidence challenging the widely held assumption that CoT fine-tuning universally enhances performance, thereby delineating its applicability boundary. The work provides critical methodological insights and practical guidelines for biomedical LLM adaptation.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have shown great potential in medical question answering (MedQA), yet adapting them to biomedical reasoning remains challenging due to domain-specific complexity and limited supervision. In this work, we study how prompt design and lightweight fine-tuning affect the performance of open-source LLMs on PubMedQA, a benchmark for multiple-choice biomedical questions. We focus on two widely used prompting strategies - standard instruction prompts and Chain-of-Thought (CoT) prompts - and apply QLoRA for parameter-efficient instruction tuning. Across multiple model families and sizes, our experiments show that CoT prompting alone can improve reasoning in zero-shot settings, while instruction tuning significantly boosts accuracy. However, fine-tuning on CoT prompts does not universally enhance performance and may even degrade it for certain larger models. These findings suggest that reasoning-aware prompts are useful, but their benefits are model- and scale-dependent. Our study offers practical insights into combining prompt engineering with efficient finetuning for medical QA applications.

Problem

Research questions and friction points this paper is trying to address.

Adapting LLMs to biomedical reasoning challenges

Improving medical QA with prompt design and fine-tuning

Evaluating CoT prompting and instruction tuning on PubMedQA

Innovation

Methods, ideas, or system contributions that make the work stand out.

Instruction tuning boosts medical QA accuracy

CoT prompting enhances zero-shot reasoning

QLoRA enables parameter-efficient fine-tuning

🔎 Similar Papers

No similar papers found.

Authors to Follow