🤖 AI Summary
This study addresses the suboptimal performance of large language models on Arabic clinical question answering—specifically in multiple-choice and open-ended tasks. We propose a novel optimization framework integrating prompt engineering with ensemble learning. Our method introduces a triple-prompt configuration ensemble strategy and a unified role-playing prompt (e.g., “senior Arabic-speaking clinician”) to jointly enhance model comprehension and generation capabilities across complex clinical scenarios, including bias detection, cloze-style reasoning, and doctor–patient dialogue. Built upon Gemini 2.5 Flash, the approach incorporates few-shot prompting, structured data preprocessing, domain-adapted exemplars, and answer post-processing. Evaluated on the AraHealthQA-2025 benchmark, our framework achieves second place on both subtasks, significantly improving answer accuracy, robustness, and clinical plausibility. This work establishes a reusable, prompt-driven paradigm for medical AI in low-resource languages.
📝 Abstract
We present our systems for Track 2 (General Arabic Health QA, MedArabiQ) of the AraHealthQA-2025 shared task, where our methodology secured 2nd place in both Sub-Task 1 (multiple-choice question answering) and Sub-Task 2 (open-ended question answering) in Arabic clinical contexts. For Sub-Task 1, we leverage the Gemini 2.5 Flash model with few-shot prompting, dataset preprocessing, and an ensemble of three prompt configurations to improve classification accuracy on standard, biased, and fill-in-the-blank questions. For Sub-Task 2, we employ a unified prompt with the same model, incorporating role-playing as an Arabic medical expert, few-shot examples, and post-processing to generate concise responses across fill-in-the-blank, patient-doctor Q&A, GEC, and paraphrased variants.