BIS Reasoning 1.0: The First Large-Scale Japanese Benchmark for Belief-Inconsistent Syllogistic Reasoning

📅 2025-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the robustness deficits of large language models (LLMs) in syllogistic reasoning where conclusions are logically valid but contradict human commonsense beliefs. To address the lack of evaluation resources, we introduce J-BiST—the first large-scale, manually verified Japanese benchmark for belief-inconsistent syllogisms. J-BiST is systematically constructed from formal logic to generate contradictory premises and conclusions, enabling controlled assessment of belief–logic conflict resolution. We evaluate state-of-the-art LLMs—including GPT-4o, Claude, and leading Japanese-language models—using a zero-shot, multi-model evaluation framework. Results show that while GPT-4o achieves the highest accuracy (79.54%), all models exhibit significant belief bias, consistently failing to prioritize logical validity over intuitive plausibility. This reveals a systematic truth-aversion bias in LLMs under truth-prioritizing tasks. Our core contribution is a scalable, logic-grounded evaluation paradigm for belief–logic conflicts, enabling the first quantitative assessment of this capability in Japanese.

Technology Category

Application Category

📝 Abstract
We present BIS Reasoning 1.0, the first large-scale Japanese dataset of syllogistic reasoning problems explicitly designed to evaluate belief-inconsistent reasoning in large language models (LLMs). Unlike prior datasets such as NeuBAROCO and JFLD, which focus on general or belief-aligned reasoning, BIS Reasoning 1.0 introduces logically valid yet belief-inconsistent syllogisms to uncover reasoning biases in LLMs trained on human-aligned corpora. We benchmark state-of-the-art models - including GPT models, Claude models, and leading Japanese LLMs - revealing significant variance in performance, with GPT-4o achieving 79.54% accuracy. Our analysis identifies critical weaknesses in current LLMs when handling logically valid but belief-conflicting inputs. These findings have important implications for deploying LLMs in high-stakes domains such as law, healthcare, and scientific literature, where truth must override intuitive belief to ensure integrity and safety.
Problem

Research questions and friction points this paper is trying to address.

Evaluating belief-inconsistent reasoning in Japanese LLMs
Uncovering reasoning biases in human-aligned LLMs
Assessing LLM performance on logically valid but belief-conflicting inputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

First large-scale Japanese syllogistic reasoning dataset
Designed for belief-inconsistent reasoning evaluation
Benchmarks state-of-the-art LLMs' performance variances
🔎 Similar Papers
No similar papers found.
H
Ha-Thanh Nguyen
Research and Development Center for Large Language Models, NII, Tokyo, Japan
C
Chaoran Liu
Research and Development Center for Large Language Models, NII, Tokyo, Japan
H
Hirokazu Kiyomaru
Research and Development Center for Large Language Models, NII, Tokyo, Japan
Koichi Takeda
Koichi Takeda
National Institute of Informatics
Natural Language Processing
Y
Yusuke Miyao
Research and Development Center for Large Language Models, NII, Tokyo, Japan
M
Maki Matsuda
Research and Development Center for Large Language Models, NII, Tokyo, Japan
Yusuke Oda
Yusuke Oda
NAIST
Natural Language ProcessingMachine TranslationSoftware Engineering
P
Pontus Stenetorp
Research and Development Center for Large Language Models, NII, Tokyo, Japan
Q
Qianying Liu
Research and Development Center for Large Language Models, NII, Tokyo, Japan
S
Su Myat Noe
Research and Development Center for Large Language Models, NII, Tokyo, Japan
H
Hideyuki Tachibana
Research and Development Center for Large Language Models, NII, Tokyo, Japan
Kouta Nakayama
Kouta Nakayama
理化学研究所
自然言語処理
S
Sadao Kurohashi
Research and Development Center for Large Language Models, NII, Tokyo, Japan