!MSA at AraHealthQA 2025 Shared Task: Enhancing LLM Performance for Arabic Clinical Question Answering through Prompt Engineering and Ensemble Learning

📅 2025-09-14

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This study addresses the suboptimal performance of large language models on Arabic clinical question answering—specifically in multiple-choice and open-ended tasks. We propose a novel optimization framework integrating prompt engineering with ensemble learning. Our method introduces a triple-prompt configuration ensemble strategy and a unified role-playing prompt (e.g., “senior Arabic-speaking clinician”) to jointly enhance model comprehension and generation capabilities across complex clinical scenarios, including bias detection, cloze-style reasoning, and doctor–patient dialogue. Built upon Gemini 2.5 Flash, the approach incorporates few-shot prompting, structured data preprocessing, domain-adapted exemplars, and answer post-processing. Evaluated on the AraHealthQA-2025 benchmark, our framework achieves second place on both subtasks, significantly improving answer accuracy, robustness, and clinical plausibility. This work establishes a reusable, prompt-driven paradigm for medical AI in low-resource languages.

Technology Category

Application Category

📝 Abstract

We present our systems for Track 2 (General Arabic Health QA, MedArabiQ) of the AraHealthQA-2025 shared task, where our methodology secured 2nd place in both Sub-Task 1 (multiple-choice question answering) and Sub-Task 2 (open-ended question answering) in Arabic clinical contexts. For Sub-Task 1, we leverage the Gemini 2.5 Flash model with few-shot prompting, dataset preprocessing, and an ensemble of three prompt configurations to improve classification accuracy on standard, biased, and fill-in-the-blank questions. For Sub-Task 2, we employ a unified prompt with the same model, incorporating role-playing as an Arabic medical expert, few-shot examples, and post-processing to generate concise responses across fill-in-the-blank, patient-doctor Q&A, GEC, and paraphrased variants.

Problem

Research questions and friction points this paper is trying to address.

Improving Arabic clinical question answering accuracy

Enhancing LLM performance through prompt engineering

Applying ensemble learning for medical QA tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Ensemble learning with multiple prompt configurations

Few-shot prompting for Arabic clinical contexts

Role-playing as medical expert for response generation

🔎 Similar Papers

No similar papers found.