π€ AI Summary
Existing large language models (LLMs) lack systematic evaluation of legal knowledge acquisition and reasoning capabilities in the Japanese legal domain. Method: We introduce JBE-QAβthe first publicly available, multi-domain legal question-answering dataset tailored to the Japanese Bar Examination (2015β2024), covering civil law, criminal law, and constitutional law. We innovatively reformulate multiple-choice questions into structured true/false judgment tasks and propose a fine-grained annotation framework to support rigorous legal reasoning evaluation. Contribution/Results: JBE-QA establishes the first unified, cross-domain benchmark for Japanese legal AI, overcoming prior limitations confined to civil law. Comprehensive evaluation across 26 LLMs reveals that proprietary reasoning-enabled models achieve top performance, and constitutional law questions exhibit lower overall difficulty than civil or criminal law items. This work provides the first open, multi-domain, high-fidelity evaluation benchmark for legal AI research in Japan.
π Abstract
We introduce JBE-QA, a Japanese Bar Exam Question-Answering dataset to evaluate large language models' legal knowledge. Derived from the multiple-choice (tanto-shiki) section of the Japanese bar exam (2015-2024), JBE-QA provides the first comprehensive benchmark for Japanese legal-domain evaluation of LLMs. It covers the Civil Code, the Penal Code, and the Constitution, extending beyond the Civil Code focus of prior Japanese resources. Each question is decomposed into independent true/false judgments with structured contextual fields. The dataset contains 3,464 items with balanced labels. We evaluate 26 LLMs, including proprietary, open-weight, Japanese-specialised, and reasoning models. Our results show that proprietary models with reasoning enabled perform best, and the Constitution questions are generally easier than the Civil Code or the Penal Code questions.