Instruction Tuning and CoT Prompting for Contextual Medical QA with LLMs

📅 2025-06-13
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenges of high domain complexity and scarce annotated data in biomedical multiple-choice question answering (MedQA), specifically on PubMedQA. Methodologically, it systematically investigates the synergistic optimization of open-source large language models (LLMs) via lightweight fine-tuning and prompt engineering—integrating zero-shot chain-of-thought (CoT) prompting, standard instruction tuning, and parameter-efficient QLoRA fine-tuning across multiple leading open LLM families. Key findings reveal that CoT prompting substantially improves zero-shot performance; instruction tuning delivers consistent gains; however, combining CoT-aware fine-tuning with QLoRA exhibits model-scale dependency—yielding performance degradation in certain larger models. This constitutes the first empirical evidence challenging the widely held assumption that CoT fine-tuning universally enhances performance, thereby delineating its applicability boundary. The work provides critical methodological insights and practical guidelines for biomedical LLM adaptation.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have shown great potential in medical question answering (MedQA), yet adapting them to biomedical reasoning remains challenging due to domain-specific complexity and limited supervision. In this work, we study how prompt design and lightweight fine-tuning affect the performance of open-source LLMs on PubMedQA, a benchmark for multiple-choice biomedical questions. We focus on two widely used prompting strategies - standard instruction prompts and Chain-of-Thought (CoT) prompts - and apply QLoRA for parameter-efficient instruction tuning. Across multiple model families and sizes, our experiments show that CoT prompting alone can improve reasoning in zero-shot settings, while instruction tuning significantly boosts accuracy. However, fine-tuning on CoT prompts does not universally enhance performance and may even degrade it for certain larger models. These findings suggest that reasoning-aware prompts are useful, but their benefits are model- and scale-dependent. Our study offers practical insights into combining prompt engineering with efficient finetuning for medical QA applications.
Problem

Research questions and friction points this paper is trying to address.

Adapting LLMs to biomedical reasoning challenges
Improving medical QA with prompt design and fine-tuning
Evaluating CoT prompting and instruction tuning on PubMedQA
Innovation

Methods, ideas, or system contributions that make the work stand out.

Instruction tuning boosts medical QA accuracy
CoT prompting enhances zero-shot reasoning
QLoRA enables parameter-efficient fine-tuning
🔎 Similar Papers
No similar papers found.
C
Chenqian Le
New York University, New York, USA
Z
Ziheng Gong
New York University, New York, USA
C
Chihang Wang
New York University, New York, USA
H
Haowei Ni
Columbia University, New York, USA
P
Panfeng Li
University of Michigan, Ann Arbor, USA
Xupeng Chen
Xupeng Chen
Research Scientist, TikTok | Ph.D. in Electrical Engineering, New York University
LLMMulti-ModalBCIComputer VisionNature Language Processing