🤖 AI Summary
Large language models (LLMs) and vision-language models (VLMs) exhibit significant age bias in medical AI—particularly severe in pediatric settings—undermining fairness and clinical utility. To address this, we introduce PedQA, the first comprehensive multimodal pediatric question-answering benchmark, comprising 3,417 text-based multiple-choice questions and 2,067 image-based QA items across 131 pediatric topics, all developmental stages, and diverse medical imaging modalities. Data are curated via a hybrid pipeline integrating authoritative clinical literature, board-certified question banks, and public resources, ensuring high quality and clinical validity. Systematic evaluation across leading open-source LLMs and VLMs reveals a precipitous performance drop on infant-related queries—a first empirical confirmation of structural age bias in medical AI. PedQA thus establishes a critical benchmark for detecting, attributing, and mitigating age bias, providing foundational evidence to advance equitable pediatric AI.
📝 Abstract
Large language models (LLMs) and vision-augmented LLMs (VLMs) have significantly advanced medical informatics, diagnostics, and decision support. However, these models exhibit systematic biases, particularly age bias, compromising their reliability and equity. This is evident in their poorer performance on pediatric-focused text and visual question-answering tasks. This bias reflects a broader imbalance in medical research, where pediatric studies receive less funding and representation despite the significant disease burden in children. To address these issues, a new comprehensive multi-modal pediatric question-answering benchmark, PediatricsMQA, has been introduced. It consists of 3,417 text-based multiple-choice questions (MCQs) covering 131 pediatric topics across seven developmental stages (prenatal to adolescent) and 2,067 vision-based MCQs using 634 pediatric images from 67 imaging modalities and 256 anatomical regions. The dataset was developed using a hybrid manual-automatic pipeline, incorporating peer-reviewed pediatric literature, validated question banks, existing benchmarks, and existing QA resources. Evaluating state-of-the-art open models, we find dramatic performance drops in younger cohorts, highlighting the need for age-aware methods to ensure equitable AI support in pediatric care.