KokushiMD-10: Benchmark for Evaluating Large Language Models on Ten Japanese National Healthcare Licensing Examinations

πŸ“… 2025-06-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing medical AI benchmarks are predominantly English-only and drug-centric, failing to assess multilingual, multimodal, and generalist clinical reasoning capabilities. Method: We introduce J-MedBenchβ€”the first multimodal benchmark covering all ten nationally administered Japanese medical licensing examinations across five domains (medicine, dentistry, nursing, pharmacy, and public health), comprising 11,588 authentic exam questions, associated clinical images, and expert-level annotations. We propose cross-modal question structuring and a multilingual consistency evaluation protocol. Contribution/Results: We systematically evaluate over 30 state-of-the-art multimodal large language models (e.g., GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro). Results reveal that no model consistently achieves passing thresholds across any domain, exposing fundamental limitations in high-stakes clinical reasoning. J-MedBench is publicly released, establishing a new standard for multilingual, multimodal medical AI evaluation.

Technology Category

Application Category

πŸ“ Abstract
Recent advances in large language models (LLMs) have demonstrated notable performance in medical licensing exams. However, comprehensive evaluation of LLMs across various healthcare roles, particularly in high-stakes clinical scenarios, remains a challenge. Existing benchmarks are typically text-based, English-centric, and focus primarily on medicines, which limits their ability to assess broader healthcare knowledge and multimodal reasoning. To address these gaps, we introduce KokushiMD-10, the first multimodal benchmark constructed from ten Japanese national healthcare licensing exams. This benchmark spans multiple fields, including Medicine, Dentistry, Nursing, Pharmacy, and allied health professions. It contains over 11588 real exam questions, incorporating clinical images and expert-annotated rationales to evaluate both textual and visual reasoning. We benchmark over 30 state-of-the-art LLMs, including GPT-4o, Claude 3.5, and Gemini, across both text and image-based settings. Despite promising results, no model consistently meets passing thresholds across domains, highlighting the ongoing challenges in medical AI. KokushiMD-10 provides a comprehensive and linguistically grounded resource for evaluating and advancing reasoning-centric medical AI across multilingual and multimodal clinical tasks.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs in diverse healthcare roles and high-stakes scenarios
Overcoming English-centric, text-only medical benchmarks limitations
Assessing multimodal reasoning in multilingual clinical tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal benchmark for healthcare exams
Includes clinical images and expert rationales
Evaluates multilingual and multimodal reasoning
πŸ”Ž Similar Papers
No similar papers found.
J
Junyu Liu
Kyoto University
K
Kaiqi Yan
The Hong Kong University of Science and Technology
Tianyang Wang
Tianyang Wang
University of Alabama at Birmingham
machine learning (deep learning)computer vision
Qian Niu
Qian Niu
UT Austin
Condensed matter physics
M
M. Nagai-Tanima
Kyoto University
T
T. Aoyama
Kyoto University