Assessing Large Language Models for Medical QA: Zero-Shot and LLM-as-a-Judge Evaluation

📅 2026-02-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study evaluates the zero-shot performance of large language models (LLMs) on medical question-answering tasks, with a focus on deployment feasibility in resource-constrained settings. Leveraging the iCliniq dataset, the authors establish a standardized benchmark for medical QA and conduct a systematic comparison of five prominent LLMs without task-specific fine-tuning. Response quality is assessed using multiple metrics, including BLEU, ROUGE, and LLM-as-a-Judge evaluations. Results indicate that Llama-3.3-70B-Instruct achieves the strongest overall performance, while Llama-4-Maverick-17B offers the best trade-off between inference efficiency and accuracy. These findings elucidate the interplay among model scale, computational efficiency, and clinical utility, providing empirical guidance for designing lightweight yet effective medical NLP systems.

Technology Category

Application Category

📝 Abstract
Recently, Large Language Models (LLMs) have gained significant traction in medical domain, especially in developing a QA systems to Medical QA systems for enhancing access to healthcare in low-resourced settings. This paper compares five LLMs deployed between April 2024 and August 2025 for medical QA, using the iCliniq dataset, containing 38,000 medical questions and answers of diverse specialties. Our models include Llama-3-8B-Instruct, Llama 3.2 3B, Llama 3.3 70B Instruct, Llama-4-Maverick-17B-128E-Instruct, and GPT-5-mini. We are using a zero-shot evaluation methodology and using BLEU and ROUGE metrics to evaluate performance without specialized fine-tuning. Our results show that larger models like Llama 3.3 70B Instruct outperform smaller models, consistent with observed scaling benefits in clinical tasks. It is notable that, Llama-4-Maverick-17B exhibited more competitive results, thus highlighting evasion efficiency trade-offs relevant for practical deployment. These findings align with advancements in LLM capabilities toward professional-level medical reasoning and reflect the increasing feasibility of LLM-supported QA systems in the real clinical environments. This benchmark aims to serve as a standardized setting for future study to minimize model size, computational resources and to maximize clinical utility in medical NLP applications.
Problem

Research questions and friction points this paper is trying to address.

Medical QA
Large Language Models
Zero-Shot Evaluation
LLM-as-a-Judge
Clinical NLP
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot evaluation
Medical QA benchmark
LLM-as-a-Judge
Model scaling trade-offs
Clinical NLP
🔎 Similar Papers
No similar papers found.
S
Shefayat E Shams Adib
Department of Computer Science and Engineering, Islamic University of Technology, Gazipur, Bangladesh
A
Ahmed Alfey Sani
Department of Computer Science and Engineering, Islamic University of Technology, Gazipur, Bangladesh
E
Ekramul Alam Esham
Department of Computer Science and Engineering, Islamic University of Technology, Gazipur, Bangladesh
Ajwad Abrar
Ajwad Abrar
Junior Lecturer, IUT
Natural Language ProcessingHuman Computer InteractionSoftware Engineering
Tareque Mohmud Chowdhury
Tareque Mohmud Chowdhury
Assistant Professor, Islamic University of Technology
BioinformaticsNatural Language Processing