Filling in the Clinical Gaps in Benchmark: Case for HealthBench for the Japanese medical system

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Japan lacks a localized benchmark for evaluating large language models (LLMs) in medical applications. Method: This study systematically assesses the applicability of HealthBench to Japanese clinical settings via machine translation, LLM-as-a-judge automated classification, and empirical evaluation across 5,000 multi-scenario cases. Contribution/Results: We identify critical issues—including clinical guideline mismatches, institutional discrepancies, and cultural norm conflicts—arising from direct translation. To address these, we propose J-HealthBench, a context-aware, Japan-specific medical evaluation benchmark emphasizing clinical integrity and cultural adaptation. Experiments reveal that GPT-4.1 underperforms due to non-localized scoring criteria; native Japanese LLMs fail severely owing to clinical knowledge gaps; and over 60% of original HealthBench items require structural redefinition to align with Japanese clinical practice. This work establishes a methodological paradigm and reusable localization framework for cross-lingual medical AI evaluation.

Technology Category

Application Category

📝 Abstract

This study investigates the applicability of HealthBench, a large-scale, rubric-based medical benchmark, to the Japanese context. While robust evaluation frameworks are crucial for the safe development of medical LLMs, resources in Japanese remain limited, often relying on translated multiple-choice questions. Our research addresses this gap by first establishing a performance baseline, applying a machine-translated version of HealthBench's 5,000 scenarios to evaluate both a high-performing multilingual model (GPT-4.1) and a Japanese-native open-source model (LLM-jp-3.1). Second, we employ an LLM-as-a-Judge approach to systematically classify the benchmark's scenarios and rubric criteria, identifying "contextual gaps" where content is misaligned with Japan's clinical guidelines, healthcare systems, or cultural norms. Our findings reveal a modest performance drop in GPT-4.1 due to rubric mismatches and a significant failure in the Japanese-native model, which lacked the required clinical completeness. Furthermore, our classification indicates that while the majority of scenarios are applicable, a substantial portion of the rubric criteria requires localization. This work underscores the limitations of direct benchmark translation and highlights the urgent need for a context-aware, localized adaptation, a J-HealthBench, to ensure the reliable and safe evaluation of medical LLMs in Japan.

Problem

Research questions and friction points this paper is trying to address.

Evaluating medical LLMs in Japan lacks appropriate localized benchmarks

Direct translation of medical benchmarks creates contextual mismatches with Japanese guidelines

Existing Japanese medical evaluation resources rely heavily on translated multiple-choice questions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Applied machine-translated HealthBench scenarios to Japanese models

Used LLM-as-a-Judge to classify contextual gaps in benchmark

Identified need for localized adaptation of rubric criteria

🔎 Similar Papers

No similar papers found.

Authors to Follow