HEAD-QA v2: Expanding a Healthcare Benchmark for Reasoning

📅 2025-11-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing biomedical reasoning benchmarks suffer from limited linguistic diversity and shallow conceptual coverage, hindering progress in medical reasoning research. To address this, we introduce MedQA-ES/EN—the first bilingual (Spanish/English) multiple-choice medical reasoning benchmark—comprising over 12,000 high-quality questions drawn from ten years of professional medical licensing examinations. We systematically evaluate state-of-the-art reasoning strategies, including prompt engineering, retrieval-augmented generation (RAG), and probabilistic answer selection, across multiple open-source large language models. Our empirical analysis reveals that advanced reasoning techniques yield only marginal gains; instead, model parameter count remains the dominant factor governing performance. This work fills a critical gap in multilingual medical reasoning evaluation and provides a reproducible, extensible benchmarking framework. By establishing rigorous, linguistically diverse evaluation standards, MedQA-ES/EN serves as a reliable foundation for future research in cross-lingual biomedical reasoning.

Technology Category

Application Category

📝 Abstract
We introduce HEAD-QA v2, an expanded and updated version of a Spanish/English healthcare multiple-choice reasoning dataset originally released by Vilares and G'omez-Rodr'iguez (2019). The update responds to the growing need for high-quality datasets that capture the linguistic and conceptual complexity of healthcare reasoning. We extend the dataset to over 12,000 questions from ten years of Spanish professional exams, benchmark several open-source LLMs using prompting, RAG, and probability-based answer selection, and provide additional multilingual versions to support future work. Results indicate that performance is mainly driven by model scale and intrinsic reasoning ability, with complex inference strategies obtaining limited gains. Together, these results establish HEAD-QA v2 as a reliable resource for advancing research on biomedical reasoning and model improvement.
Problem

Research questions and friction points this paper is trying to address.

Expanding healthcare reasoning dataset to 12,000 multilingual questions
Benchmarking LLMs on complex medical inference and reasoning
Addressing limitations in biomedical reasoning through improved evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Expanded healthcare dataset to 12000 questions
Benchmarked LLMs using prompting and RAG techniques
Provided multilingual versions for biomedical reasoning research
🔎 Similar Papers
No similar papers found.
A
Alexis Correa-Guill'en
Universidade da Coruña, CITIC, Departamento de Ciencias de la Computación y Tecnologías de la Información, Campus de Elviña s/n 15071, A Coruña, Spain
C
Carlos G'omez-Rodr'iguez
Universidade da Coruña, CITIC, Departamento de Ciencias de la Computación y Tecnologías de la Información, Campus de Elviña s/n 15071, A Coruña, Spain
David Vilares
David Vilares
Universidade da Coruña, CITIC
Natural Language Processing