PediatricsMQA: a Multi-modal Pediatrics Question Answering Benchmark

📅 2025-08-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) and vision-language models (VLMs) exhibit significant age bias in medical AI—particularly severe in pediatric settings—undermining fairness and clinical utility. To address this, we introduce PedQA, the first comprehensive multimodal pediatric question-answering benchmark, comprising 3,417 text-based multiple-choice questions and 2,067 image-based QA items across 131 pediatric topics, all developmental stages, and diverse medical imaging modalities. Data are curated via a hybrid pipeline integrating authoritative clinical literature, board-certified question banks, and public resources, ensuring high quality and clinical validity. Systematic evaluation across leading open-source LLMs and VLMs reveals a precipitous performance drop on infant-related queries—a first empirical confirmation of structural age bias in medical AI. PedQA thus establishes a critical benchmark for detecting, attributing, and mitigating age bias, providing foundational evidence to advance equitable pediatric AI.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) and vision-augmented LLMs (VLMs) have significantly advanced medical informatics, diagnostics, and decision support. However, these models exhibit systematic biases, particularly age bias, compromising their reliability and equity. This is evident in their poorer performance on pediatric-focused text and visual question-answering tasks. This bias reflects a broader imbalance in medical research, where pediatric studies receive less funding and representation despite the significant disease burden in children. To address these issues, a new comprehensive multi-modal pediatric question-answering benchmark, PediatricsMQA, has been introduced. It consists of 3,417 text-based multiple-choice questions (MCQs) covering 131 pediatric topics across seven developmental stages (prenatal to adolescent) and 2,067 vision-based MCQs using 634 pediatric images from 67 imaging modalities and 256 anatomical regions. The dataset was developed using a hybrid manual-automatic pipeline, incorporating peer-reviewed pediatric literature, validated question banks, existing benchmarks, and existing QA resources. Evaluating state-of-the-art open models, we find dramatic performance drops in younger cohorts, highlighting the need for age-aware methods to ensure equitable AI support in pediatric care.
Problem

Research questions and friction points this paper is trying to address.

Addressing age bias in medical AI models for pediatric care
Evaluating performance gaps in pediatric question-answering tasks
Developing multi-modal benchmark for equitable pediatric medical AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal pediatric question-answering benchmark
Hybrid manual-automatic pipeline development
Age-aware AI methods for pediatric care
🔎 Similar Papers
No similar papers found.