🤖 AI Summary
Japan lacks a localized benchmark for evaluating large language models (LLMs) in medical applications. Method: This study systematically assesses the applicability of HealthBench to Japanese clinical settings via machine translation, LLM-as-a-judge automated classification, and empirical evaluation across 5,000 multi-scenario cases. Contribution/Results: We identify critical issues—including clinical guideline mismatches, institutional discrepancies, and cultural norm conflicts—arising from direct translation. To address these, we propose J-HealthBench, a context-aware, Japan-specific medical evaluation benchmark emphasizing clinical integrity and cultural adaptation. Experiments reveal that GPT-4.1 underperforms due to non-localized scoring criteria; native Japanese LLMs fail severely owing to clinical knowledge gaps; and over 60% of original HealthBench items require structural redefinition to align with Japanese clinical practice. This work establishes a methodological paradigm and reusable localization framework for cross-lingual medical AI evaluation.
📝 Abstract
This study investigates the applicability of HealthBench, a large-scale, rubric-based medical benchmark, to the Japanese context. While robust evaluation frameworks are crucial for the safe development of medical LLMs, resources in Japanese remain limited, often relying on translated multiple-choice questions. Our research addresses this gap by first establishing a performance baseline, applying a machine-translated version of HealthBench's 5,000 scenarios to evaluate both a high-performing multilingual model (GPT-4.1) and a Japanese-native open-source model (LLM-jp-3.1). Second, we employ an LLM-as-a-Judge approach to systematically classify the benchmark's scenarios and rubric criteria, identifying "contextual gaps" where content is misaligned with Japan's clinical guidelines, healthcare systems, or cultural norms. Our findings reveal a modest performance drop in GPT-4.1 due to rubric mismatches and a significant failure in the Japanese-native model, which lacked the required clinical completeness. Furthermore, our classification indicates that while the majority of scenarios are applicable, a substantial portion of the rubric criteria requires localization. This work underscores the limitations of direct benchmark translation and highlights the urgent need for a context-aware, localized adaptation, a J-HealthBench, to ensure the reliable and safe evaluation of medical LLMs in Japan.